(DEPRECATED) Apache Flink User Mailing List archive.

JobManager trying to re-submit jobs after failover

Classic

List

Threaded

5 messages Options

Hironori Ogibayashi

JobManager trying to re-submit jobs after failover

Hello,

I have standalone Flink cluster with JobManager HA.
Last night, JobManager failovered because of the connection timeout to
Zookeeper.
Job is successfully running under new leader JobManager, but when
I see the old leader JobManager log, it is trying to re-submit job and
getting errors. ( for almost 24 hours now)

Here is the log.

-----
2016-07-27 20:56:09,218 WARN
org.apache.flink.runtime.jobmanager.JobManager -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:09 Job execution switched to status RESTARTING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,218 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Recovering checkpoints from ZooKeeper.
2016-07-27 20:56:19,218 WARN
org.apache.flink.runtime.jobmanager.JobManager -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19 Job execution switched to status CREATED.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,219 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Found 1 checkpoints in ZooKeeper.
2016-07-27 20:56:19,221 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Initialized with Checkpoint 40229 @ 1469620528216 for
978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
2016-07-27 20:56:19,222 WARN
org.apache.flink.runtime.jobmanager.JobManager -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19 Job execution switched to status RUNNING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,222 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
switched from CREATED to SCHEDULED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
switched from SCHEDULED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 WARN
org.apache.flink.runtime.jobmanager.JobManager -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19 Job execution switched to status FAILING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
- Delaying retry of job execution for 10000 ms ...
2016-07-27 20:56:19,223 WARN
org.apache.flink.runtime.jobmanager.JobManager -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19 Job execution switched to status RESTARTING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
----

Could anyone advise me why this happens and how I can recover from
this situation? (restart JobManager?)

Regards,
Hironori Ogibayashi

Ufuk Celebi

Re: JobManager trying to re-submit jobs after failover

Which version of Flink are you running on? I think this might have
been fixed for the 1.1 release
(http://people.apache.org/~uce/flink-1.1.0-rc1/).

It looks like the ExecutionGraph is still trying to restart although
the JobManager is not the leader anymore. If you could provide the
complete logs of both JobManagers, that would be helpful to be sure
what is happening.

You can recover from this by restarting the respective JobManager
process (by running "jobmanager.sh stop" script on that machine and
starting again via "jobmanager.sh start cluster") .

– Ufuk

On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
<[hidden email]> wrote:

> Hello,
>
> I have standalone Flink cluster with JobManager HA.
> Last night, JobManager failovered because of the connection timeout to
> Zookeeper.
> Job is successfully running under new leader JobManager, but when
> I see the old leader JobManager log, it is trying to re-submit job and
> getting errors. ( for almost 24 hours now)
>
> Here is the log.
>
> -----
> 2016-07-27 20:56:09,218 WARN
> org.apache.flink.runtime.jobmanager.JobManager -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:09 Job execution switched to status RESTARTING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,218 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Recovering checkpoints from ZooKeeper.
> 2016-07-27 20:56:19,218 WARN
> org.apache.flink.runtime.jobmanager.JobManager -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19 Job execution switched to status CREATED.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,219 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Found 1 checkpoints in ZooKeeper.
> 2016-07-27 20:56:19,221 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Initialized with Checkpoint 40229 @ 1469620528216 for
> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
> 2016-07-27 20:56:19,222 WARN
> org.apache.flink.runtime.jobmanager.JobManager -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19 Job execution switched to status RUNNING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,222 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
> switched from CREATED to SCHEDULED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
> switched from SCHEDULED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 WARN
> org.apache.flink.runtime.jobmanager.JobManager -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19 Job execution switched to status FAILING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
> - Delaying retry of job execution for 10000 ms ...
> 2016-07-27 20:56:19,223 WARN
> org.apache.flink.runtime.jobmanager.JobManager -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19 Job execution switched to status RESTARTING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> ----
>
> Could anyone advise me why this happens and how I can recover from
> this situation? (restart JobManager?)
>
> Regards,
> Hironori Ogibayashi

Hironori Ogibayashi

Re: JobManager trying to re-submit jobs after failover

Thank you so much for your quick response.
I am running Flink 1.0.3.

I have attached jobmanager logs. The failover happend around 7/26 21:13 JST.

Regards,
Hironori

2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>:

> Which version of Flink are you running on? I think this might have
> been fixed for the 1.1 release
> (http://people.apache.org/~uce/flink-1.1.0-rc1/).
>
> It looks like the ExecutionGraph is still trying to restart although
> the JobManager is not the leader anymore. If you could provide the
> complete logs of both JobManagers, that would be helpful to be sure
> what is happening.
>
> You can recover from this by restarting the respective JobManager
> process (by running "jobmanager.sh stop" script on that machine and
> starting again via "jobmanager.sh start cluster") .
>
> – Ufuk
>
> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
> <[hidden email]> wrote:
>> Hello,
>>
>> I have standalone Flink cluster with JobManager HA.
>> Last night, JobManager failovered because of the connection timeout to
>> Zookeeper.
>> Job is successfully running under new leader JobManager, but when
>> I see the old leader JobManager log, it is trying to re-submit job and
>> getting errors. ( for almost 24 hours now)
>>
>> Here is the log.
>>
>> -----
>> 2016-07-27 20:56:09,218 WARN
>> org.apache.flink.runtime.jobmanager.JobManager -
>> Discard message
>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>> 20:56:09 Job execution switched to status RESTARTING.) because the
>> expected leader session ID None did not equal the received leader
>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>> 2016-07-27 20:56:19,218 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>> - Recovering checkpoints from ZooKeeper.
>> 2016-07-27 20:56:19,218 WARN
>> org.apache.flink.runtime.jobmanager.JobManager -
>> Discard message
>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>> 20:56:19 Job execution switched to status CREATED.) because the
>> expected leader session ID None did not equal the received leader
>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>> 2016-07-27 20:56:19,219 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>> - Found 1 checkpoints in ZooKeeper.
>> 2016-07-27 20:56:19,221 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>> - Initialized with Checkpoint 40229 @ 1469620528216 for
>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
>> 2016-07-27 20:56:19,222 WARN
>> org.apache.flink.runtime.jobmanager.JobManager -
>> Discard message
>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>> 20:56:19 Job execution switched to status RUNNING.) because the
>> expected leader session ID None did not equal the received leader
>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>> 2016-07-27 20:56:19,222 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>> switched from CREATED to SCHEDULED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>> switched from SCHEDULED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 WARN
>> org.apache.flink.runtime.jobmanager.JobManager -
>> Discard message
>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>> 20:56:19 Job execution switched to status FAILING.) because the
>> expected leader session ID None did not equal the received leader
>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
>> CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
>> CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
>> CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> TriggerWindow(GlobalWindows(),
>> FoldingStateDescriptor{name=window-contents,
>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> TriggerWindow(GlobalWindows(),
>> FoldingStateDescriptor{name=window-contents,
>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>> TriggerWindow(GlobalWindows(),
>> FoldingStateDescriptor{name=window-contents,
>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
>> switched from CREATED to CANCELED
>> 2016-07-27 20:56:19,223 INFO
>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
>> - Delaying retry of job execution for 10000 ms ...
>> 2016-07-27 20:56:19,223 WARN
>> org.apache.flink.runtime.jobmanager.JobManager -
>> Discard message
>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>> 20:56:19 Job execution switched to status RESTARTING.) because the
>> expected leader session ID None did not equal the received leader
>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>> ----
>>
>> Could anyone advise me why this happens and how I can recover from
>> this situation? (restart JobManager?)
>>
>> Regards,
>> Hironori Ogibayashi

jobmanager_log.tgz (5M) Download Attachment

Ufuk Celebi

Re: JobManager trying to re-submit jobs after failover

Thanks for the logs. Looking through them it's caused by this issue:
https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph
(Flink's internal scheduling structure) is not terminated properly and
tries to restart the job over and over again.

This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1
when it's out? We might need to backport this fix for 1.0.4. The work
around is as I've described, just restart jobmanager-5.

On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi
<[hidden email]> wrote:

> Thank you so much for your quick response.
> I am running Flink 1.0.3.
>
> I have attached jobmanager logs. The failover happend around 7/26 21:13 JST.
>
> Regards,
> Hironori
>
> 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>:
>> Which version of Flink are you running on? I think this might have
>> been fixed for the 1.1 release
>> (http://people.apache.org/~uce/flink-1.1.0-rc1/).
>>
>> It looks like the ExecutionGraph is still trying to restart although
>> the JobManager is not the leader anymore. If you could provide the
>> complete logs of both JobManagers, that would be helpful to be sure
>> what is happening.
>>
>> You can recover from this by restarting the respective JobManager
>> process (by running "jobmanager.sh stop" script on that machine and
>> starting again via "jobmanager.sh start cluster") .
>>
>> – Ufuk
>>
>> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
>> <[hidden email]> wrote:
>>> Hello,
>>>
>>> I have standalone Flink cluster with JobManager HA.
>>> Last night, JobManager failovered because of the connection timeout to
>>> Zookeeper.
>>> Job is successfully running under new leader JobManager, but when
>>> I see the old leader JobManager log, it is trying to re-submit job and
>>> getting errors. ( for almost 24 hours now)
>>>
>>> Here is the log.
>>>
>>> -----
>>> 2016-07-27 20:56:09,218 WARN
>>> org.apache.flink.runtime.jobmanager.JobManager -
>>> Discard message
>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>> 20:56:09 Job execution switched to status RESTARTING.) because the
>>> expected leader session ID None did not equal the received leader
>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>> 2016-07-27 20:56:19,218 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>> - Recovering checkpoints from ZooKeeper.
>>> 2016-07-27 20:56:19,218 WARN
>>> org.apache.flink.runtime.jobmanager.JobManager -
>>> Discard message
>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>> 20:56:19 Job execution switched to status CREATED.) because the
>>> expected leader session ID None did not equal the received leader
>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>> 2016-07-27 20:56:19,219 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>> - Found 1 checkpoints in ZooKeeper.
>>> 2016-07-27 20:56:19,221 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>> - Initialized with Checkpoint 40229 @ 1469620528216 for
>>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
>>> 2016-07-27 20:56:19,222 WARN
>>> org.apache.flink.runtime.jobmanager.JobManager -
>>> Discard message
>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>> 20:56:19 Job execution switched to status RUNNING.) because the
>>> expected leader session ID None did not equal the received leader
>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>> 2016-07-27 20:56:19,222 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>> switched from CREATED to SCHEDULED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>> switched from SCHEDULED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 WARN
>>> org.apache.flink.runtime.jobmanager.JobManager -
>>> Discard message
>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>> 20:56:19 Job execution switched to status FAILING.) because the
>>> expected leader session ID None did not equal the received leader
>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
>>> CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
>>> CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
>>> CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> TriggerWindow(GlobalWindows(),
>>> FoldingStateDescriptor{name=window-contents,
>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> TriggerWindow(GlobalWindows(),
>>> FoldingStateDescriptor{name=window-contents,
>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>> TriggerWindow(GlobalWindows(),
>>> FoldingStateDescriptor{name=window-contents,
>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
>>> switched from CREATED to CANCELED
>>> 2016-07-27 20:56:19,223 INFO
>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
>>> - Delaying retry of job execution for 10000 ms ...
>>> 2016-07-27 20:56:19,223 WARN
>>> org.apache.flink.runtime.jobmanager.JobManager -
>>> Discard message
>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>> 20:56:19 Job execution switched to status RESTARTING.) because the
>>> expected leader session ID None did not equal the received leader
>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>> ----
>>>
>>> Could anyone advise me why this happens and how I can recover from
>>> this situation? (restart JobManager?)
>>>
>>> Regards,
>>> Hironori Ogibayashi

Hironori Ogibayashi

Re: JobManager trying to re-submit jobs after failover

Thank you for telling me about the cause.
It recovered by restarting jobmanager-5 and jobmanager-1.

I restart jobmanager-1 because when I restarted jobmanager-5 ,
checkpointing started to
fail with the following message.

----
2016-07-28 10:42:28,217 WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Failed
to trigger checkpoint (19 consecutive failed attempts so far)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
= NoNode for /flink/flink_prod/checkpoint-counter/978ef000cca5a3aa6f3461274102f82c
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:274)
at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:270)
at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:267)
at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:253)
at org.apache.flink.shaded.org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:41)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedValue.trySetValue(SharedValue.java:168)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.shared.SharedCount.trySetCount(SharedCount.java:111)
at org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.getAndIncrement(ZooKeeperCheckpointIDCounter.java:121)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:411)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:339)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:928)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
----

Anyway, thank you so much for your advice.
I think it would be great if the fix will be backported to 1.0.4.

Regards,
Hionori

2016-07-28 0:08 GMT+09:00 Ufuk Celebi <[hidden email]>:

> Thanks for the logs. Looking through them it's caused by this issue:
> https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph
> (Flink's internal scheduling structure) is not terminated properly and
> tries to restart the job over and over again.
>
> This is fixed for 1.1.0. Is it an option for you to upgrade to 1.1
> when it's out? We might need to backport this fix for 1.0.4. The work
> around is as I've described, just restart jobmanager-5.
>
>
>
> On Wed, Jul 27, 2016 at 2:55 PM, Hironori Ogibayashi
> <[hidden email]> wrote:
>> Thank you so much for your quick response.
>> I am running Flink 1.0.3.
>>
>> I have attached jobmanager logs. The failover happend around 7/26 21:13 JST.
>>
>> Regards,
>> Hironori
>>
>> 2016-07-27 21:21 GMT+09:00 Ufuk Celebi <[hidden email]>:
>>> Which version of Flink are you running on? I think this might have
>>> been fixed for the 1.1 release
>>> (http://people.apache.org/~uce/flink-1.1.0-rc1/).
>>>
>>> It looks like the ExecutionGraph is still trying to restart although
>>> the JobManager is not the leader anymore. If you could provide the
>>> complete logs of both JobManagers, that would be helpful to be sure
>>> what is happening.
>>>
>>> You can recover from this by restarting the respective JobManager
>>> process (by running "jobmanager.sh stop" script on that machine and
>>> starting again via "jobmanager.sh start cluster") .
>>>
>>> – Ufuk
>>>
>>> On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
>>> <[hidden email]> wrote:
>>>> Hello,
>>>>
>>>> I have standalone Flink cluster with JobManager HA.
>>>> Last night, JobManager failovered because of the connection timeout to
>>>> Zookeeper.
>>>> Job is successfully running under new leader JobManager, but when
>>>> I see the old leader JobManager log, it is trying to re-submit job and
>>>> getting errors. ( for almost 24 hours now)
>>>>
>>>> Here is the log.
>>>>
>>>> -----
>>>> 2016-07-27 20:56:09,218 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:09 Job execution switched to status RESTARTING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,218 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Recovering checkpoints from ZooKeeper.
>>>> 2016-07-27 20:56:19,218 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19 Job execution switched to status CREATED.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,219 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Found 1 checkpoints in ZooKeeper.
>>>> 2016-07-27 20:56:19,221 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>> - Initialized with Checkpoint 40229 @ 1469620528216 for
>>>> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
>>>> 2016-07-27 20:56:19,222 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19 Job execution switched to status RUNNING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,222 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>>> switched from CREATED to SCHEDULED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
>>>> switched from SCHEDULED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19 Job execution switched to status FAILING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>>> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>>> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat
>>>> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
>>>> CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> TriggerWindow(GlobalWindows(),
>>>> FoldingStateDescriptor{name=window-contents,
>>>> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
>>>> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
>>>> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
>>>> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
>>>> switched from CREATED to CANCELED
>>>> 2016-07-27 20:56:19,223 INFO
>>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
>>>> - Delaying retry of job execution for 10000 ms ...
>>>> 2016-07-27 20:56:19,223 WARN
>>>> org.apache.flink.runtime.jobmanager.JobManager -
>>>> Discard message
>>>> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
>>>> 20:56:19 Job execution switched to status RESTARTING.) because the
>>>> expected leader session ID None did not equal the received leader
>>>> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
>>>> ----
>>>>
>>>> Could anyone advise me why this happens and how I can recover from
>>>> this situation? (restart JobManager?)
>>>>
>>>> Regards,
>>>> Hironori Ogibayashi