Flink recovers even cancelled jobs after zookeeper failure

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flink recovers even cancelled jobs after zookeeper failure

Yunus Olgun
Hi,
We are using Flink 1.4.0 at zookeeper high availability mode and with externalized checkpoints. Today after we have restarted a zookeeper node, several Flink clusters have lost connection to the zookeeper. This triggered a leader election at effected clusters. After the leader election, the new leader recovered running jobs. Up to this point, everything is good.

Then the new leader also has restarted old canceled jobs. This seems to be a bug and caused problems. A snippet from the logs is at the bottom of this email.

Is this a known problem? If it helps, we usually cancel through Flink UI. Also, any ideas on how to prevent such a thing from happening again are appreciated.

Regards,
Yunus


2018-05-15 16:31:50,332 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable to read additional data from server sessionid 0x15f58694b13bcda, likely server has closed socket, closing socket connection and attempting reconnect
2018-05-15 16:31:50,434 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED
2018-05-15 16:31:50,434 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@ino10:44855
1/user/jobmanager no longer participates in the leader election.
2018-05-15 16:31:50,434 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
2018-05-15 16:31:50,434 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
2018-05-15 16:31:50,434 WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monn
itored (temporarily).
2018-05-15 16:31:50,496 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS conff
iguration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5118766953326230150.conf'. Will continue connection to Zookeeper server without SASL autthentication, if Zookeeper server allows it.
2018-05-15 16:31:50,496 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181
2018-05-15 16:31:50,496 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-05-15 16:31:50,496 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiatt
ing session
2018-05-15 16:31:50,498 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server nebula-0.xxxxx.lan/172.30.0.100:2181,,
 sessionid = 0x15f58694b13bcda, negotiated timeout = 60000
onfiguration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5118766953326230150.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-05-15 16:31:50,496 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181
2018-05-15 16:31:50,496 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-05-15 16:31:50,496 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiating session
2018-05-15 16:31:50,498 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server nebula-0.xxxxx.lan/172.30.0.100:2181, sessionid = 0x15f58694b13bcda, negotiated timeout = 60000
2018-05-15 16:31:50,498 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: RECONNECTED
2018-05-15 16:31:50,498 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-05-15 16:31:50,499 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-05-15 16:31:50,499 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-05-15 16:31:50,499 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2018-05-15 16:31:51,364 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@ino10:44851/user/jobmanager was granted leadership with leader session ID Some(0c6e1947-c2a6-45cf-b4c6-e006357affb6).
2018-05-15 16:31:51,372 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Delaying recovery of all jobs by 10000 milliseconds.
2018-05-15 16:31:51,375 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Trying to associate with JobManager leader akka.tcp://flink@ino10:44851/user/jobmanager
2018-05-15 16:31:51,375 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#692352189] - leader session 0c6e1947-c2a6-45cf-b4c6-e006357affb6
2018-05-15 16:31:51,399 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager 09d5caf33316cec17d6f2d089ed5bd94 has started.
2018-05-15 16:31:51,399 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ino11 (akka.tcp://flink@ino11:42603/user/taskmanager) as dfe46e4854ee1e2f5072783d61cfbb81. Current number of registered hosts is 1. Current number of alive task slots is 8.
2018-05-15 16:31:51,401 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ino11 (akka.tcp://flink@ino11:45607/user/taskmanager) as ca2881ee6ab11f9b5ffb7acb249494f5. Current number of registered hosts is 2. Current number of alive task slots is 16.
2018-05-15 16:31:51,402 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager 659642a31157802db9de994f7d6669e5 has started.
2018-05-15 16:31:51,413 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager 515a0db8bd16ed1dfc23332cebafe92f has started.
2018-05-15 16:31:51,413 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ino10 (akka.tcp://flink@ino10:33003/user/taskmanager) as 30d3491cf7edf3c48bc9498558052e63. Current number of registered hosts is 3. Current number of alive task slots is 24.
2018-05-15 16:31:51,430 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager b938f5ba1e40e17f60bc64e9bc39d79a has started.
2018-05-15 16:31:51,430 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ino10 (akka.tcp://flink@ino10:40339/user/taskmanager) as 0b01240b5169f94d2b870a3fd25f41df. Current number of registered hosts is 4. Current number of alive task slots is 32.
2018-05-15 16:32:01,389 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover all jobs.

2018-05-15 16:32:01,391 INFO  org.apache.flink.runtime.jobmanager.JobManager                - There are 7 jobs to recover. Starting the job recovery.
2018-05-15 16:32:01,391 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job cd8ba3012f3eefcae4dea2f1487fe0e2.
2018-05-15 16:32:01,391 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job 8fdf7c630f47836fc2ce615fff96dada.
2018-05-15 16:32:01,393 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job 6b1117a5b7873d0cf68e084404ba750b.
2018-05-15 16:32:01,394 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job cca8fe39a515c63d407c83232e009bd8.
2018-05-15 16:32:01,395 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job 5fd8f4d97c91c4c66c6748457ab8b66b.
2018-05-15 16:32:01,396 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job d87590afbb9fb8bc24a4ef38729a9967.
2018-05-15 16:32:01,396 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover job b74b14696f637126b60a079b7a7d4710.