(DEPRECATED) Apache Flink User Mailing List archive.

Flink recovers even cancelled jobs after zookeeper failure

Classic

List

Threaded

1 message

Yunus Olgun

Flink recovers even cancelled jobs after zookeeper failure

Hi,

We are using Flink 1.4.0 at zookeeper high availability mode and with externalized checkpoints. Today after we have restarted a zookeeper node, several Flink clusters have lost connection to the zookeeper. This triggered a leader election at effected clusters. After the leader election, the new leader recovered running jobs. Up to this point, everything is good.

Then the new leader also has restarted old canceled jobs. This seems to be a bug and caused problems. A snippet from the logs is at the bottom of this email.

Is this a known problem? If it helps, we usually cancel through Flink UI. Also, any ideas on how to prevent such a thing from happening again are appreciated.

Regards,

Yunus

2018-05-15 16:31:50,332 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x15f58694b13bcda, likely server has closed socket, closing socket connection and attempting reconnect

2018-05-15 16:31:50,434 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED

2018-05-15 16:31:50,434 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@ino10:44855

1/user/jobmanager no longer participates in the leader election.

2018-05-15 16:31:50,434 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from

ZooKeeper.

2018-05-15 16:31:50,434 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from

ZooKeeper.

2018-05-15 16:31:50,434 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monn

itored (temporarily).

2018-05-15 16:31:50,496 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS conff

iguration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5118766953326230150.conf'. Will continue connection to Zookeeper server without SASL autthentication, if Zookeeper server allows it.

2018-05-15 16:31:50,496 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181

2018-05-15 16:31:50,496 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

2018-05-15 16:31:50,496 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiatt

ing session

2018-05-15 16:31:50,498 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server nebula-0.xxxxx.lan/172.30.0.100:2181,,

sessionid = 0x15f58694b13bcda, negotiated timeout = 60000

onfiguration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5118766953326230150.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

2018-05-15 16:31:50,496 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181

2018-05-15 16:31:50,496 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

2018-05-15 16:31:50,496 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiating session

2018-05-15 16:31:50,498 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server nebula-0.xxxxx.lan/172.30.0.100:2181, sessionid = 0x15f58694b13bcda, negotiated timeout = 60000

2018-05-15 16:31:50,498 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

2018-05-15 16:31:50,498 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.

2018-05-15 16:31:50,499 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.

2018-05-15 16:31:50,499 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.

2018-05-15 16:31:51,364 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@ino10:44851/user/jobmanager was granted leadership with leader session ID Some(0c6e1947-c2a6-45cf-b4c6-e006357affb6).

2018-05-15 16:31:51,372 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.

2018-05-15 16:31:51,375 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@ino10:44851/user/jobmanager

2018-05-15 16:31:51,375 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#692352189] - leader session 0c6e1947-c2a6-45cf-b4c6-e006357affb6

2018-05-15 16:31:51,399 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - TaskManager 09d5caf33316cec17d6f2d089ed5bd94 has started.

2018-05-15 16:31:51,399 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at ino11 (akka.tcp://flink@ino11:42603/user/taskmanager) as dfe46e4854ee1e2f5072783d61cfbb81. Current number of registered hosts is 1. Current number of alive task slots is 8.

2018-05-15 16:31:51,401 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at ino11 (akka.tcp://flink@ino11:45607/user/taskmanager) as ca2881ee6ab11f9b5ffb7acb249494f5. Current number of registered hosts is 2. Current number of alive task slots is 16.

2018-05-15 16:31:51,402 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - TaskManager 659642a31157802db9de994f7d6669e5 has started.

2018-05-15 16:31:51,413 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - TaskManager 515a0db8bd16ed1dfc23332cebafe92f has started.

2018-05-15 16:31:51,413 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at ino10 (akka.tcp://flink@ino10:33003/user/taskmanager) as 30d3491cf7edf3c48bc9498558052e63. Current number of registered hosts is 3. Current number of alive task slots is 24.

2018-05-15 16:31:51,430 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - TaskManager b938f5ba1e40e17f60bc64e9bc39d79a has started.

2018-05-15 16:31:51,430 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at ino10 (akka.tcp://flink@ino10:40339/user/taskmanager) as 0b01240b5169f94d2b870a3fd25f41df. Current number of registered hosts is 4. Current number of alive task slots is 32.

2018-05-15 16:32:01,389 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.

2018-05-15 16:32:01,391 INFO org.apache.flink.runtime.jobmanager.JobManager - There are 7 jobs to recover. Starting the job recovery.

2018-05-15 16:32:01,391 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job cd8ba3012f3eefcae4dea2f1487fe0e2.

2018-05-15 16:32:01,391 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job 8fdf7c630f47836fc2ce615fff96dada.

2018-05-15 16:32:01,393 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job 6b1117a5b7873d0cf68e084404ba750b.

2018-05-15 16:32:01,394 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job cca8fe39a515c63d407c83232e009bd8.

2018-05-15 16:32:01,395 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job 5fd8f4d97c91c4c66c6748457ab8b66b.

2018-05-15 16:32:01,396 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job d87590afbb9fb8bc24a4ef38729a9967.

2018-05-15 16:32:01,396 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover job b74b14696f637126b60a079b7a7d4710.