Flink version - 1.12.2
Operation type - Session cluster, running on k8s with k8s HA and S3 used for savepoints, checkpoints and state We have this version installed for a couple of months now and it was working perfectly. A couple of weeks ago when submitting any job the job will stay in "initializing" state for about 10-15 minutes and then move to the running state but the tasks themselves will be stuck i "scheduled" state for another 5 minutes we have 17 running jobs on this cluster but most of them are very small and i cannot correlate when this issue began with some increase in job count Jobmanager logs (after submitting a job called "bot mitigation")- (Logs related to checkpoints were filtered here since as far as i understand it's not relevant to jobs that are not running yet, it was just way too much noise) 15:02:16.113 2021-05-11 22:02:16,112 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader. 15:02:04.618 2021-05-11 22:02:04,618 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-044421de4ae92698885225479233e75d-jobmanager-leader. 15:01:40.583 2021-05-11 22:01:40,583 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-50910ef637f19e052840691fe2342183-jobmanager-leader. 15:00:27.553 2021-05-11 22:00:27,553 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap flink-1-12-7606619c350810409e22e5c7398838c1-jobmanager-leader. 15:00:17.552 2021-05-11 22:00:17,551 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{b600089c82ea4f51d43b5010c99fca87}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{decf0b393af4c82154937f8310818f67}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (1/10) (e098296dd48c57c7578ec3dab456ae62) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (4/10) (d26038276c9646f225051cbeaf4cbaa1) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (5/10) (3f5e999ad7876c45c44fc633e4cc6542) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (6/10) (4fa434576db3cec4979c719a331031a9) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameSessionIdDetection) (2/2) (41984298935fa45df3cb1c864ad8c81f) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameIpDetection) (2/2) (dd25197745029b0bc73a63d91a70ce40) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SamePageNameDetection) (2/2) (1ac2ed0aba3863bbaa3156280e047893) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameHostDetection) (1/2) (f205e7d22fb52ce0aa33b780f17fa000) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{138e3de0153006ffe317c2bb40557d1f}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (2/10) (06ea559a3f2c333ffffb90c93d0076c6) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (9/10) (96a2374190af5ee84d7ab31a4c7c2a2c) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameSessionIdDetection) (1/2) (e7ad1cf1cb110d11c6f17a0ea3ffb58b) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SamePageNameDetection) (1/2) (c41979321fdae4a1d539c67896f3e3dc) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameHostDetection) (2/2) (81457e82829bd6300051ed01170fc626) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{cbc4fb135ccacb567378c569e8799e56}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{5a4ad54567ba54b71f5d9b7935d7ae27}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{51e015ff1e7d20842f2f4321b10949ac}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (3/10) (53f3d64ab079234bbc2b7c2821d1509d) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (7/10) (c18989fcfe6bd7fff57c1b4e44209cf9) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (8/10) (bf8128138977c6809f5daa0252d61bd7) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: k8s jukwaa access -> Flat Map (10/10) (09f049b9fbf90b57a0dbb036612f1329) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SameIpDetection) (1/2) (3ecbe4c82b2f52b038cc57fa23d56288) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(SlidingProcessingTimeWindows(600000, 300000), ProcessingTimeTrigger, SummaryAction) -> Sink: bot mitigation stats sink (1/1) (235c13ff1dfc4b100acecbd8907be5b5) switched from CREATED to SCHEDULED. 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{99c68bfecfc7b60b43546a055dbd5e8b}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{b85b3944efa6a4f7b5c9d99c95be307d}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{bd0ed34969d68afe82476c78d92ab3a0}] 14:59:46.614 2021-05-11 21:59:46,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{b802729a1a63c950019f20ed0d470063}] 14:59:46.613 2021-05-11 21:59:46,613 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55) switched from state CREATED to RUNNING. 14:59:46.613 2021-05-11 21:59:46,613 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.PipelinedRegionSchedulingStrategy] 14:59:46.612 2021-05-11 21:59:46,612 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='flink-1-12-resourcemanager-leader'}. 14:59:46.612 2021-05-11 21:59:46,612 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Starting execution of job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55) under job master id 83b9f2c54ee06fd406a9a1c8c29f483a. 14:59:36.599 2021-05-11 21:59:36,599 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55) was granted leadership with session id 06a9a1c8-c29f-483a-83b9-f2c54ee06fd4 at akka.tcp://flink@flink-jobmanager-1-12:6123/user/rpc/jobmanager_66. 14:59:36.597 2021-05-11 21:59:36,597 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - All 0 checkpoints found are already downloaded. 14:59:36.597 2021-05-11 21:59:36,597 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@226ea85e for Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55). 14:59:36.597 2021-05-11 21:59:36,597 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 0 checkpoints in KubernetesStateHandleStore{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}. 14:59:36.590 2021-05-11 21:59:36,590 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}. 14:59:36.590 2021-05-11 21:59:36,590 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - No state backend has been configured, using default (Memory / JobManager) MemoryStateBackend (data in heap memory / checkpoints to JobManager) (checkpoints: 's3://houzz-flink-1-12-session-cluster-production/checkpoints', savepoints: 's3://houzz-flink-1-12-session-cluster-production/savepoints', asynchronous: TRUE, maxStateSize: 5242880) 14:59:36.589 2021-05-11 21:59:36,589 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Successfully ran initialization on master in 0 ms. 14:59:36.589 2021-05-11 21:59:36,589 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Running initialization on master for job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55). 14:59:36.589 2021-05-11 21:59:36,589 INFO org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 0 ms 14:59:36.588 2021-05-11 21:59:36,588 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55). 14:59:36.587 2021-05-11 21:59:36,587 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Initializing job Flink Bot Mitigation Job (e6dd018db15cd04d24bde2602efbbf55). 14:59:36.587 2021-05-11 21:59:36,586 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader'}. 14:59:36.587 2021-05-11 21:59:36,587 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_66 . 14:56:12.678 2021-05-11 21:56:12,677 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-7694dbd48b64c4e5f1b4d5d87966150e-jobmanager-leader. 14:55:09.565 2021-05-11 21:55:09,565 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap flink-1-12-0937ffb39f6626f5a457cec002bfc522-jobmanager-leader. 14:54:59.564 2021-05-11 21:54:59,563 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-556808f8c77ed581c8cb35d7ae5d9b31-jobmanager-leader. 14:54:49.562 2021-05-11 21:54:49,562 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-69a56519afb31a03fdca0bf0b3eb72b9-jobmanager-leader. 14:54:39.895 2021-05-11 21:54:39,895 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-dispatcher-leader. 14:54:00.437 2021-05-11 21:54:00,437 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap flink-1-12-b083375e263a88af7a28135422c12d7b-jobmanager-leader. 14:53:50.417 2021-05-11 21:53:50,417 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader. 14:53:48.061 2021-05-11 21:53:48,061 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader. 14:53:40.416 2021-05-11 21:53:40,416 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-resourcemanager-leader. 14:53:36.651 2021-05-11 21:53:36,651 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-e55c62f6a4e1fab1037e37239a6e2f65-jobmanager-leader. 14:51:01.722 2021-05-11 21:51:01,721 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-7606619c350810409e22e5c7398838c1-jobmanager-leader. 14:47:45.352 2021-05-11 21:47:45,352 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap flink-1-12-8eb2c33937730a89feeb17d93d2a3213-jobmanager-leader. 14:47:27.738 2021-05-11 21:47:27,738 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected b9a79352-2ee3-452c-b3a3-2029f8ba2f6b for flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader. 14:47:27.733 2021-05-11 21:47:27,727 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added JobGraph(jobId: e6dd018db15cd04d24bde2602efbbf55) to KubernetesStateHandleStore{configMapName='flink-1-12-dispatcher-leader'}. 14:47:27.733 2021-05-11 21:47:27,728 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector flink-1-12-e6dd018db15cd04d24bde2602efbbf55-jobmanager-leader with lock identity b9a79352-2ee3-452c-b3a3-2029f8ba2f6b. 14:47:27.516 2021-05-11 21:47:27,516 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting job e6dd018db15cd04d24bde2602efbbf55 (Flink Bot Mitigation Job). 14:47:27.515 2021-05-11 21:47:27,514 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received JobGraph submission e6dd018db15cd04d24bde2602efbbf55 (Flink Bot Mitigation Job). 14:47:26.565 2021-05-11 21:47:26,565 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Submitting Job with JobId=e6dd018db15cd04d24bde2602efbbf55. 14:47:26.565 2021-05-11 21:47:26,565 INFO org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] - Job e6dd018db15cd04d24bde2602efbbf55 is submitted. A couple of things i'm noticing but don't really understand- 1. Why are so many watches being created? 2.Cannot serve slot request, no ResourceManager connected. Adding as pending request - I have enough task slots and according to flink's metrics, the taskmanager have quite a lot of free memory space -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Solved this.
After trying multiple things i just decided to delete the massive number of old leader configMaps from my k8s cluster (200+) After that the time to initialize + schedule is down to a couple of seconds -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |