http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Flink-1-2-1-JobManager-Election-Deadlock-tp15427p15456.html
related to Curator. Very valuable information!
> Thanks for informing us. As far as I know, we were not aware of any deadlock
> in the JobManager election. Let's hope that the updated Curator version
> fixed the problem. We will defenitely keep an eye on this. Feel free to
> contact the dev@ mailing list, if the problem still exists in 1.3.2.
>
> Regards,
> Timo
>
> Am 06.09.17 um 22:58 schrieb James Bucher:
>
> It seems I forgot to attach the full logs. They should be attached now.
>
> From: James Bucher <
[hidden email]>
> Date: Wednesday, September 6, 2017 at 10:43 AM
> To: "
[hidden email]" <
[hidden email]>
> Subject: Flink 1.2.1 JobManager Election Deadlock
>
> Hey all,
>
> Just wanted to report this for posterity in case someone else sees something
> similar. We were running Flink 1.2.1 in Kubernetes. We use an HA setup with
> an external Zookeeper and S3 for checkpointing. We recently noticed a job
> that appears to have deadlocked on JobManager Leader election. We think the
> issue happened something like the following:
>
> Job was up and running normally
> Some cluster event caused the JobManager Pod (process) to get restarted.
> JobManager came up again but got stuck on LeaderElection. At this time the
> JobManager UI sent back a response with "Service temporarily unavailable due
> to an ongoing leader election. Please refresh."
> JobManager never exited this state.
> This state persisted across Pod restarts/deletes.
>
> In order to try to pin down this problem we brought up the Job in another
> Flink Cluster and started debugging this issue. As a first step I upped the
> logging level on the JobManager and applied the change. This resulted in the
> following log (Full logs attached to this email):
>
> 2017-09-05 18:50:55,072 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver -
> Trace: GetDataBuilderImpl-Background - 4 ms
>
> 2017-09-05 18:50:55,075 DEBUG
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
> Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)
>
> 2017-09-05 18:50:55,088 DEBUG
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
> Received CHILD_ADDED event notification for job
> 4e4cdc8fb8c1437c620cd4063bd265e1
>
> 2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 1,3 replyHeader:: 1,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com,F response::
> s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
> 2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 2,3 replyHeader:: 2,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
> response::
> s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
> 2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 3,3 replyHeader:: 3,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
> response::
> s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
> 2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 4,3 replyHeader:: 4,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
> response::
> s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
> 2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 5,3 replyHeader:: 5,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
> response::
> s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
> 2017-09-05 18:50:56,085 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Starting ZooKeeperLeaderRetrievalService.
>
> 2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 6,3 replyHeader:: 6,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com,F response::
> s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
> 2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 7,3 replyHeader:: 7,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
> response::
> s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
> 2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 8,3 replyHeader:: 8,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
> response::
> s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
> 2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 9,3 replyHeader:: 9,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
> response::
> s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
> 2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 10,3 replyHeader:: 10,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
> response::
> s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
> 2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
> clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> finished:false header:: 11,3 replyHeader:: 11,21476396435,0 request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
> response::
> s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
> 2017-09-05 18:50:56,092 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver -
> Trace: ExistsBuilderImpl-Background - 1 ms
>
> 2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
> clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> finished:false header:: 12,4 replyHeader:: 12,21476396435,0 request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
> response::
> ,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
> 2017-09-05 18:50:56,093 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver -
> Trace: GetDataBuilderImpl-Background - 1 ms
>
> 2017-09-05 18:50:56,093 DEBUG
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Leader node has changed.
>
>
> At this point I dug into the logs to try and see what was going on. It
> quickly became apparent that this probably had something to do with Apache
> Curator which Flink relies upon to do the leader election. At this point I
> grabbed a dump of zookeeper using:
https://github.com/ctapmex/zkTreeUtil:>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <root
> path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
> <zknode name="default">
> <zknode name="leaderlatch">
> <zknode name="00000000000000000000000000000000">
> <zknode name="job_manager_lock"/>
> </zknode>
> <zknode ephemeral="true"
> name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
> </zknode>
> <zknode name="checkpoint-counter">
> <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
> value="���"/>
> </zknode>
> <zknode name="checkpoints">
> <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
> </zknode>
> <zknode name="jobgraphs">
> <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
> value="���sr�;org.apache.flink.runtime.state.RetrievableStreamStateHandle��U�+�L�wrappedStreamStateHandlet�2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr�9org.apache.flink.runtime.state.filesystem.FileStateHandle�u�b���J�	stateSizeL�filePatht�Lorg/apache/flink/core/fs/Path;xp�����h�sr�org.apache.flink.core.fs.Path��������L�urit�Ljava/net/URI;xpsr�java.net.URI�x.C�I��L�stringt�Ljava/lang/String;xpt��s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
> </zknode>
> <zknode name="leader">
> <zknode name="00000000000000000000000000000000"/>
> </zknode>
> </zknode>
> </root>
>
> In a healthy cluster there is generally a node under “job_manager_lock”
> which represents the elected leader. This pointed to a problem with the
> Apache Curator framework. I opened up both Flink 1.2. And 1.3.2 to see what
> version of Curator Flink 1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2
> uses Curator 2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release
> notes for curator there were quite a few bugs might be the cause of this
> (
https://issues.apache.org/jira/browse/CURATOR-264 for example). I was
> curious to see if leader election would become unstuck if the cluster was
> upgraded to Flink 1.3.2. I upgraded the JobManager to 1.3.2 and sure enough
> leader election proceeded correctly. Upon Upgrading the TaskManagers the job
> restored from its last checkpoint correctly and continued.
>
> Other interesting things of note:
>
> Restarting/Killing the JobManager didn’t resolve the issue. I suspect this
> is due to bad state in zookeeper
> Zookeeper was running throughout this time. We didn’t see issues in our
> kafka cluster and Zookeeper has a leader elected correctly and the Zookeeper
> command line utility worked without issue.
> Restarts of the JobManager appeared to correctly create ephemeral nodes in
> zookeeper to represent possible leaders (which also appear to get cleaned up
> correctly).
>
> As far as I can tell this issue is resolved with the newer Apache Curator
> version in 1.3.2. It may be useful to keep an eye out for this issue
> however.
>
> Thanks,
> James Bucher
>
>
>