Hello,
I have standalone Flink cluster with JobManager HA. Last night, JobManager failovered because of the connection timeout to Zookeeper. Job is successfully running under new leader JobManager, but when I see the old leader JobManager log, it is trying to re-submit job and getting errors. ( for almost 24 hours now) Here is the log. ----- 2016-07-27 20:56:09,218 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 20:56:09 Job execution switched to status RESTARTING.) because the expected leader session ID None did not equal the received leader session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). 2016-07-27 20:56:19,218 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2016-07-27 20:56:19,218 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 20:56:19 Job execution switched to status CREATED.) because the expected leader session ID None did not equal the received leader session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). 2016-07-27 20:56:19,219 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2016-07-27 20:56:19,221 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Initialized with Checkpoint 40229 @ 1469620528216 for 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. 2016-07-27 20:56:19,222 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 20:56:19 Job execution switched to status RUNNING.) because the expected leader session ID None did not equal the received leader session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). 2016-07-27 20:56:19,222 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) switched from CREATED to SCHEDULED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) switched from SCHEDULED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 20:56:19 Job execution switched to status FAILING.) because the expected leader session ID None did not equal the received leader session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - TriggerWindow(GlobalWindows(), FoldingStateDescriptor{name=window-contents, defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - TriggerWindow(GlobalWindows(), FoldingStateDescriptor{name=window-contents, defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - TriggerWindow(GlobalWindows(), FoldingStateDescriptor{name=window-contents, defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) switched from CREATED to CANCELED 2016-07-27 20:56:19,223 INFO org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy - Delaying retry of job execution for 10000 ms ... 2016-07-27 20:56:19,223 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 20:56:19 Job execution switched to status RESTARTING.) because the expected leader session ID None did not equal the received leader session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). ---- Could anyone advise me why this happens and how I can recover from this situation? (restart JobManager?) Regards, Hironori Ogibayashi |
Which version of Flink are you running on? I think this might have
been fixed for the 1.1 release (http://people.apache.org/~uce/flink-1.1.0-rc1/). It looks like the ExecutionGraph is still trying to restart although the JobManager is not the leader anymore. If you could provide the complete logs of both JobManagers, that would be helpful to be sure what is happening. You can recover from this by restarting the respective JobManager process (by running "jobmanager.sh stop" script on that machine and starting again via "jobmanager.sh start cluster") . – Ufuk On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi <[hidden email]> wrote: > Hello, > > I have standalone Flink cluster with JobManager HA. > Last night, JobManager failovered because of the connection timeout to > Zookeeper. > Job is successfully running under new leader JobManager, but when > I see the old leader JobManager log, it is trying to re-submit job and > getting errors. ( for almost 24 hours now) > > Here is the log. > > ----- > 2016-07-27 20:56:09,218 WARN > org.apache.flink.runtime.jobmanager.JobManager - > Discard message > LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 > 20:56:09 Job execution switched to status RESTARTING.) because the > expected leader session ID None did not equal the received leader > session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). > 2016-07-27 20:56:19,218 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Recovering checkpoints from ZooKeeper. > 2016-07-27 20:56:19,218 WARN > org.apache.flink.runtime.jobmanager.JobManager - > Discard message > LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 > 20:56:19 Job execution switched to status CREATED.) because the > expected leader session ID None did not equal the received leader > session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). > 2016-07-27 20:56:19,219 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Found 1 checkpoints in ZooKeeper. > 2016-07-27 20:56:19,221 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Initialized with Checkpoint 40229 @ 1469620528216 for > 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. > 2016-07-27 20:56:19,222 WARN > org.apache.flink.runtime.jobmanager.JobManager - > Discard message > LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 > 20:56:19 Job execution switched to status RUNNING.) because the > expected leader session ID None did not equal the received leader > session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). > 2016-07-27 20:56:19,222 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) > switched from CREATED to SCHEDULED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) > switched from SCHEDULED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 WARN > org.apache.flink.runtime.jobmanager.JobManager - > Discard message > LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 > 20:56:19 Job execution switched to status FAILING.) because the > expected leader session ID None did not equal the received leader > session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat > Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to > CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat > Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to > CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat > Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to > CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > TriggerWindow(GlobalWindows(), > FoldingStateDescriptor{name=window-contents, > defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), > serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), > WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> > Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > TriggerWindow(GlobalWindows(), > FoldingStateDescriptor{name=window-contents, > defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), > serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), > WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> > Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > TriggerWindow(GlobalWindows(), > FoldingStateDescriptor{name=window-contents, > defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), > serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), > WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> > Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) > switched from CREATED to CANCELED > 2016-07-27 20:56:19,223 INFO > org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy > - Delaying retry of job execution for 10000 ms ... > 2016-07-27 20:56:19,223 WARN > org.apache.flink.runtime.jobmanager.JobManager - > Discard message > LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 > 20:56:19 Job execution switched to status RESTARTING.) because the > expected leader session ID None did not equal the received leader > session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). > ---- > > Could anyone advise me why this happens and how I can recover from > this situation? (restart JobManager?) > > Regards, > Hironori Ogibayashi |
Thank you so much for your quick response.
I am running Flink 1.0.3. I have attached jobmanager logs. The failover happend around 7/26 21:13 JST. Regards, Hironori 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>: > Which version of Flink are you running on? I think this might have > been fixed for the 1.1 release > (http://people.apache.org/~uce/flink-1.1.0-rc1/). > > It looks like the ExecutionGraph is still trying to restart although > the JobManager is not the leader anymore. If you could provide the > complete logs of both JobManagers, that would be helpful to be sure > what is happening. > > You can recover from this by restarting the respective JobManager > process (by running "jobmanager.sh stop" script on that machine and > starting again via "jobmanager.sh start cluster") . > > – Ufuk > > On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi > <[hidden email]> wrote: >> Hello, >> >> I have standalone Flink cluster with JobManager HA. >> Last night, JobManager failovered because of the connection timeout to >> Zookeeper. >> Job is successfully running under new leader JobManager, but when >> I see the old leader JobManager log, it is trying to re-submit job and >> getting errors. ( for almost 24 hours now) >> >> Here is the log. >> >> ----- >> 2016-07-27 20:56:09,218 WARN >> org.apache.flink.runtime.jobmanager.JobManager - >> Discard message >> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >> 20:56:09 Job execution switched to status RESTARTING.) because the >> expected leader session ID None did not equal the received leader >> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >> 2016-07-27 20:56:19,218 INFO >> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >> - Recovering checkpoints from ZooKeeper. >> 2016-07-27 20:56:19,218 WARN >> org.apache.flink.runtime.jobmanager.JobManager - >> Discard message >> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >> 20:56:19 Job execution switched to status CREATED.) because the >> expected leader session ID None did not equal the received leader >> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >> 2016-07-27 20:56:19,219 INFO >> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >> - Found 1 checkpoints in ZooKeeper. >> 2016-07-27 20:56:19,221 INFO >> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >> - Initialized with Checkpoint 40229 @ 1469620528216 for >> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. >> 2016-07-27 20:56:19,222 WARN >> org.apache.flink.runtime.jobmanager.JobManager - >> Discard message >> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >> 20:56:19 Job execution switched to status RUNNING.) because the >> expected leader session ID None did not equal the received leader >> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >> 2016-07-27 20:56:19,222 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >> switched from CREATED to SCHEDULED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >> switched from SCHEDULED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 WARN >> org.apache.flink.runtime.jobmanager.JobManager - >> Discard message >> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >> 20:56:19 Job execution switched to status FAILING.) because the >> expected leader session ID None did not equal the received leader >> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to >> CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to >> CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to >> CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> TriggerWindow(GlobalWindows(), >> FoldingStateDescriptor{name=window-contents, >> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> TriggerWindow(GlobalWindows(), >> FoldingStateDescriptor{name=window-contents, >> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - >> TriggerWindow(GlobalWindows(), >> FoldingStateDescriptor{name=window-contents, >> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) >> switched from CREATED to CANCELED >> 2016-07-27 20:56:19,223 INFO >> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy >> - Delaying retry of job execution for 10000 ms ... >> 2016-07-27 20:56:19,223 WARN >> org.apache.flink.runtime.jobmanager.JobManager - >> Discard message >> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >> 20:56:19 Job execution switched to status RESTARTING.) because the >> expected leader session ID None did not equal the received leader >> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >> ---- >> >> Could anyone advise me why this happens and how I can recover from >> this situation? (restart JobManager?) >> >> Regards, >> Hironori Ogibayashi jobmanager_log.tgz (5M) Download Attachment |
Thanks for the logs. Looking through them it's caused by this issue:
https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph (Flink's internal scheduling structure) is not terminated properly and tries to restart the job over and over again. This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1 when it's out? We might need to backport this fix for 1.0.4. The work around is as I've described, just restart jobmanager-5. On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi <[hidden email]> wrote: > Thank you so much for your quick response. > I am running Flink 1.0.3. > > I have attached jobmanager logs. The failover happend around 7/26 21:13 JST. > > Regards, > Hironori > > 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>: >> Which version of Flink are you running on? I think this might have >> been fixed for the 1.1 release >> (http://people.apache.org/~uce/flink-1.1.0-rc1/). >> >> It looks like the ExecutionGraph is still trying to restart although >> the JobManager is not the leader anymore. If you could provide the >> complete logs of both JobManagers, that would be helpful to be sure >> what is happening. >> >> You can recover from this by restarting the respective JobManager >> process (by running "jobmanager.sh stop" script on that machine and >> starting again via "jobmanager.sh start cluster") . >> >> – Ufuk >> >> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi >> <[hidden email]> wrote: >>> Hello, >>> >>> I have standalone Flink cluster with JobManager HA. >>> Last night, JobManager failovered because of the connection timeout to >>> Zookeeper. >>> Job is successfully running under new leader JobManager, but when >>> I see the old leader JobManager log, it is trying to re-submit job and >>> getting errors. ( for almost 24 hours now) >>> >>> Here is the log. >>> >>> ----- >>> 2016-07-27 20:56:09,218 WARN >>> org.apache.flink.runtime.jobmanager.JobManager - >>> Discard message >>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>> 20:56:09 Job execution switched to status RESTARTING.) because the >>> expected leader session ID None did not equal the received leader >>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>> 2016-07-27 20:56:19,218 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>> - Recovering checkpoints from ZooKeeper. >>> 2016-07-27 20:56:19,218 WARN >>> org.apache.flink.runtime.jobmanager.JobManager - >>> Discard message >>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>> 20:56:19 Job execution switched to status CREATED.) because the >>> expected leader session ID None did not equal the received leader >>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>> 2016-07-27 20:56:19,219 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>> - Found 1 checkpoints in ZooKeeper. >>> 2016-07-27 20:56:19,221 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>> - Initialized with Checkpoint 40229 @ 1469620528216 for >>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. >>> 2016-07-27 20:56:19,222 WARN >>> org.apache.flink.runtime.jobmanager.JobManager - >>> Discard message >>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>> 20:56:19 Job execution switched to status RUNNING.) because the >>> expected leader session ID None did not equal the received leader >>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>> 2016-07-27 20:56:19,222 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>> switched from CREATED to SCHEDULED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>> switched from SCHEDULED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 WARN >>> org.apache.flink.runtime.jobmanager.JobManager - >>> Discard message >>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>> 20:56:19 Job execution switched to status FAILING.) because the >>> expected leader session ID None did not equal the received leader >>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to >>> CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to >>> CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to >>> CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> TriggerWindow(GlobalWindows(), >>> FoldingStateDescriptor{name=window-contents, >>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> TriggerWindow(GlobalWindows(), >>> FoldingStateDescriptor{name=window-contents, >>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> TriggerWindow(GlobalWindows(), >>> FoldingStateDescriptor{name=window-contents, >>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) >>> switched from CREATED to CANCELED >>> 2016-07-27 20:56:19,223 INFO >>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy >>> - Delaying retry of job execution for 10000 ms ... >>> 2016-07-27 20:56:19,223 WARN >>> org.apache.flink.runtime.jobmanager.JobManager - >>> Discard message >>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>> 20:56:19 Job execution switched to status RESTARTING.) because the >>> expected leader session ID None did not equal the received leader >>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>> ---- >>> >>> Could anyone advise me why this happens and how I can recover from >>> this situation? (restart JobManager?) >>> >>> Regards, >>> Hironori Ogibayashi |
Thank you for telling me about the cause.
It recovered by restarting jobmanager-5 and jobmanager-1. I restart jobmanager-1 because when I restarted jobmanager-5 , checkpointing started to fail with the following message. ---- 2016-07-28 10:42:28,217 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Failed to trigger checkpoint (19 consecutive failed attempts so far) org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /flink/flink_prod/checkpoint-counter/978ef000cca5a3aa6f3461274102f82c at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:274) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:270) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:267) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:253) at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:41) at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedValue.trySetValue(SharedValue.java:168) at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedCount.trySetCount(SharedCount.java:111) at org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.getAndIncrement(ZooKeeperCheckpointIDCounter.java:121) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:411) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:339) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:928) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) ---- Anyway, thank you so much for your advice. I think it would be great if the fix will be backported to 1.0.4. Regards, Hionori 2016-07-28 0:08 GMT+09:00 Ufuk Celebi <[hidden email]>: > Thanks for the logs. Looking through them it's caused by this issue: > https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph > (Flink's internal scheduling structure) is not terminated properly and > tries to restart the job over and over again. > > This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1 > when it's out? We might need to backport this fix for 1.0.4. The work > around is as I've described, just restart jobmanager-5. > > > > On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi > <[hidden email]> wrote: >> Thank you so much for your quick response. >> I am running Flink 1.0.3. >> >> I have attached jobmanager logs. The failover happend around 7/26 21:13 JST. >> >> Regards, >> Hironori >> >> 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>: >>> Which version of Flink are you running on? I think this might have >>> been fixed for the 1.1 release >>> (http://people.apache.org/~uce/flink-1.1.0-rc1/). >>> >>> It looks like the ExecutionGraph is still trying to restart although >>> the JobManager is not the leader anymore. If you could provide the >>> complete logs of both JobManagers, that would be helpful to be sure >>> what is happening. >>> >>> You can recover from this by restarting the respective JobManager >>> process (by running "jobmanager.sh stop" script on that machine and >>> starting again via "jobmanager.sh start cluster") . >>> >>> – Ufuk >>> >>> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi >>> <[hidden email]> wrote: >>>> Hello, >>>> >>>> I have standalone Flink cluster with JobManager HA. >>>> Last night, JobManager failovered because of the connection timeout to >>>> Zookeeper. >>>> Job is successfully running under new leader JobManager, but when >>>> I see the old leader JobManager log, it is trying to re-submit job and >>>> getting errors. ( for almost 24 hours now) >>>> >>>> Here is the log. >>>> >>>> ----- >>>> 2016-07-27 20:56:09,218 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:09 Job execution switched to status RESTARTING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,218 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Recovering checkpoints from ZooKeeper. >>>> 2016-07-27 20:56:19,218 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status CREATED.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,219 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Found 1 checkpoints in ZooKeeper. >>>> 2016-07-27 20:56:19,221 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>> - Initialized with Checkpoint 40229 @ 1469620528216 for >>>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints. >>>> 2016-07-27 20:56:19,222 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status RUNNING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,222 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>>> switched from CREATED to SCHEDULED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0) >>>> switched from SCHEDULED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status FAILING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat >>>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to >>>> CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> TriggerWindow(GlobalWindows(), >>>> FoldingStateDescriptor{name=window-contents, >>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1), >>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000), >>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map -> >>>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec) >>>> switched from CREATED to CANCELED >>>> 2016-07-27 20:56:19,223 INFO >>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy >>>> - Delaying retry of job execution for 10000 ms ... >>>> 2016-07-27 20:56:19,223 WARN >>>> org.apache.flink.runtime.jobmanager.JobManager - >>>> Discard message >>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016 >>>> 20:56:19 Job execution switched to status RESTARTING.) because the >>>> expected leader session ID None did not equal the received leader >>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4). >>>> ---- >>>> >>>> Could anyone advise me why this happens and how I can recover from >>>> this situation? (restart JobManager?) >>>> >>>> Regards, >>>> Hironori Ogibayashi |
Free forum by Nabble | Edit this page |