Job initialization and scheduling takes a very long time (15+ minutes)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Job initialization and scheduling takes a very long time (15+ minutes)

OranShuster
Flink version - 1.12.2
Operation type - Session cluster, running on k8s with k8s HA and S3 used for
savepoints, checkpoints and state

We have this version installed for a couple of months now and it was working
perfectly. A couple of weeks ago when submitting any job the job will stay
in "initializing" state for about 10-15 minutes and then move to the running
state but the tasks themselves will be stuck i "scheduled" state for another
5 minutes

we have 17 running jobs on this cluster but most of them are very small and
i cannot correlate when this issue began with some increase in job count

Jobmanager logs (after submitting a job called "bot mitigation")-
(Logs related to checkpoints were filtered here since as far as i understand
it's not relevant to jobs that are not running yet, it was just way too much
noise)
 
15:02:16.113 2021-05-11 22:02:16,112 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader.
15:02:04.618 2021-05-11 22:02:04,618 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-044421de4ae92698885225479233e75d-jobmanager-leader.
15:01:40.583 2021-05-11 22:01:40,583 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-50910ef637f19e052840691fe2342183-jobmanager-leader.
15:00:27.553 2021-05-11 22:00:27,553 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap
flink-1-12-7606619c350810409e22e5c7398838c1-jobmanager-leader.
15:00:17.552 2021-05-11 22:00:17,551 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{b600089c82ea4f51d43b5010c99fca87}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{decf0b393af4c82154937f8310818f67}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (1/10) (e098296dd48c57c7578ec3dab456ae62) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (4/10) (d26038276c9646f225051cbeaf4cbaa1) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (5/10) (3f5e999ad7876c45c44fc633e4cc6542) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (6/10) (4fa434576db3cec4979c719a331031a9) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameSessionIdDetection) (2/2) (41984298935fa45df3cb1c864ad8c81f) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameIpDetection) (2/2) (dd25197745029b0bc73a63d91a70ce40) switched from
CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SamePageNameDetection) (2/2) (1ac2ed0aba3863bbaa3156280e047893) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameHostDetection) (1/2) (f205e7d22fb52ce0aa33b780f17fa000) switched from
CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{138e3de0153006ffe317c2bb40557d1f}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (2/10) (06ea559a3f2c333ffffb90c93d0076c6) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (9/10) (96a2374190af5ee84d7ab31a4c7c2a2c) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameSessionIdDetection) (1/2) (e7ad1cf1cb110d11c6f17a0ea3ffb58b) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SamePageNameDetection) (1/2) (c41979321fdae4a1d539c67896f3e3dc) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameHostDetection) (2/2) (81457e82829bd6300051ed01170fc626) switched from
CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{cbc4fb135ccacb567378c569e8799e56}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{5a4ad54567ba54b71f5d9b7935d7ae27}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{51e015ff1e7d20842f2f4321b10949ac}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (3/10) (53f3d64ab079234bbc2b7c2821d1509d) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (7/10) (c18989fcfe6bd7fff57c1b4e44209cf9) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (8/10) (bf8128138977c6809f5daa0252d61bd7) switched
from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s
jukwaa access -> Flat Map (10/10) (09f049b9fbf90b57a0dbb036612f1329)
switched from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SameIpDetection) (1/2) (3ecbe4c82b2f52b038cc57fa23d56288) switched from
CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger,
SummaryAction) -> Sink: bot mitigation stats sink (1/1)
(235c13ff1dfc4b100acecbd8907be5b5) switched from CREATED to SCHEDULED.
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{99c68bfecfc7b60b43546a055dbd5e8b}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{b85b3944efa6a4f7b5c9d99c95be307d}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{bd0ed34969d68afe82476c78d92ab3a0}]
14:59:46.614 2021-05-11 21:59:46,614 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve
slot request, no ResourceManager connected. Adding as pending request
[SlotRequestId{b802729a1a63c950019f20ed0d470063}]
14:59:46.613 2021-05-11 21:59:46,613 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Flink Bot
Mitigation Job (e6dd018db15cd04d24bde2602efbbf55) switched from state
CREATED to RUNNING.
14:59:46.613 2021-05-11 21:59:46,613 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Starting scheduling with
scheduling strategy
[org.apache.flink.runtime.scheduler.strategy.PipelinedRegionSchedulingStrategy]
14:59:46.612 2021-05-11 21:59:46,612 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
Starting DefaultLeaderRetrievalService with
KubernetesLeaderRetrievalDriver{configMapName='flink-1-12-resourcemanager-leader'}.
14:59:46.612 2021-05-11 21:59:46,612 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Starting execution of job
Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55) under job master
id 83b9f2c54ee06fd406a9a1c8c29f483a.
14:59:36.599 2021-05-11 21:59:36,599 INFO
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager
runner for job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55)
was granted leadership with session id 06a9a1c8-c29f-483a-83b9-f2c54ee06fd4
at akka.tcp://flink@flink-jobmanager-1-12:6123/user/rpc/jobmanager_66.
14:59:36.597 2021-05-11 21:59:36,597 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - All
0 checkpoints found are already downloaded.
14:59:36.597 2021-05-11 21:59:36,597 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@226ea85e
for Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55).
14:59:36.597 2021-05-11 21:59:36,597 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
Found 0 checkpoints in
KubernetesStateHandleStore{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}.
14:59:36.590 2021-05-11 21:59:36,590 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
Recovering checkpoints from
KubernetesStateHandleStore{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}.
14:59:36.590 2021-05-11 21:59:36,590 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - No state backend has been
configured, using default (Memory / JobManager) MemoryStateBackend (data in
heap memory / checkpoints to JobManager) (checkpoints:
's3://houzz-flink-1-12-session-cluster-production/checkpoints', savepoints:
's3://houzz-flink-1-12-session-cluster-production/savepoints', asynchronous:
TRUE, maxStateSize: 5242880)
14:59:36.589 2021-05-11 21:59:36,589 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Successfully ran
initialization on master in 0 ms.
14:59:36.589 2021-05-11 21:59:36,589 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Running initialization on
master for job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55).
14:59:36.589 2021-05-11 21:59:36,589 INFO
org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
Built 1 pipelined regions in 0 ms
14:59:36.588 2021-05-11 21:59:36,588 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off
time strategy
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647,
backoffTimeMS=1000) for Flink Bot Mitigation Job
(e6dd018db15cd04d24bde2602efbbf55).
14:59:36.587 2021-05-11 21:59:36,587 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Initializing job Flink Bot
Mitigation Job (e6dd018db15cd04d24bde2602efbbf55).
14:59:36.587 2021-05-11 21:59:36,586 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
Starting DefaultLeaderElectionService with
KubernetesLeaderElectionDriver{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}.
14:59:36.587 2021-05-11 21:59:36,587 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint
for org.apache.flink.runtime.jobmaster.JobMaster at
akka://flink/user/rpc/jobmanager_66 .
14:56:12.678 2021-05-11 21:56:12,677 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-7694dbd48b64c4e5f1b4d5d87966150e-jobmanager-leader.
14:55:09.565 2021-05-11 21:55:09,565 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap
flink-1-12-0937ffb39f6626f5a457cec002bfc522-jobmanager-leader.
14:54:59.564 2021-05-11 21:54:59,563 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-556808f8c77ed581c8cb35d7ae5d9b31-jobmanager-leader.
14:54:49.562 2021-05-11 21:54:49,562 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-69a56519afb31a03fdca0bf0b3eb72b9-jobmanager-leader.
14:54:39.895 2021-05-11 21:54:39,895 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-dispatcher-leader.
14:54:00.437 2021-05-11 21:54:00,437 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap
flink-1-12-b083375e263a88af7a28135422c12d7b-jobmanager-leader.
14:53:50.417 2021-05-11 21:53:50,417 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader.
14:53:48.061 2021-05-11 21:53:48,061 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader.
14:53:40.416 2021-05-11 21:53:40,416 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader.
14:53:36.651 2021-05-11 21:53:36,651 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-e55c62f6a4e1fab1037e37239a6e2f65-jobmanager-leader.
14:51:01.722 2021-05-11 21:51:01,721 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-7606619c350810409e22e5c7398838c1-jobmanager-leader.
14:47:45.352 2021-05-11 21:47:45,352 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
flink-1-12-8eb2c33937730a89feeb17d93d2a3213-jobmanager-leader.
14:47:27.738 2021-05-11 21:47:27,738 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- New leader elected b9a79352-2ee3-452c-b3a3-2029f8ba2f6b for
flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader.
14:47:27.733 2021-05-11 21:47:27,727 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
JobGraph(jobId: e6dd018db15cd04d24bde2602efbbf55) to
KubernetesStateHandleStore{configMapName='flink-1-12-dispatcher-leader'}.
14:47:27.733 2021-05-11 21:47:27,728 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- Create KubernetesLeaderElector
flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader with lock
identity b9a79352-2ee3-452c-b3a3-2029f8ba2f6b.
14:47:27.516 2021-05-11 21:47:27,516 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting job
e6dd018db15cd04d24bde2602efbbf55 (Flink Bot Mitigation Job).
14:47:27.515 2021-05-11 21:47:27,514 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission e6dd018db15cd04d24bde2602efbbf55 (Flink Bot Mitigation
Job).
14:47:26.565 2021-05-11 21:47:26,565 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
- Submitting Job with JobId=e6dd018db15cd04d24bde2602efbbf55.
14:47:26.565 2021-05-11 21:47:26,565 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor []
- Job e6dd018db15cd04d24bde2602efbbf55 is submitted.

A couple of things i'm noticing but don't really understand-
1. Why are so many watches being created?
2.Cannot serve slot request, no ResourceManager connected. Adding as pending
request - I have enough task slots and according to flink's metrics, the
taskmanager have quite a lot of free memory space



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Job initialization and scheduling takes a very long time (15+ minutes)

OranShuster
Solved this.
After trying multiple things i just decided to delete the massive number of
old leader configMaps from my k8s cluster (200+)
After that the time to initialize + schedule is down to a couple of seconds



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/