Flink Jobmanager Failover in HA mode

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Jobmanager Failover in HA mode

Helmut Zechmann-2
Hi all,

we have a problem with flink 1.5.2 high availability in standalone mode.

We have two jobmanagers running. When I shut down the main job manager, the failover job manager encounters an error during failover.

Logs:


2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[hidden email]:29095] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: seg-1.adjust.com/178.162.219.66:29095
2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[hidden email]:29095] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]:29095]] Caused by: [Connection refused: seg-1.adjust.com/178.162.219.66:29095]
2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://[hidden email]:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        [... shortened ...]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://[hidden email]:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
        ... 9 more
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://seg-2.adjust.com:8083 was granted leadership with leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://[hidden email]:30169/user/resourcemanager was granted leadership with fencing token 8de829de14876a367a80d37194b944ee
2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://[hidden email]:30169/user/dispatcher was granted leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        [... shortened ...]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
        at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
        at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
        at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
        at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
        ... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
        ... 25 more
Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        [... shortened ...]
        ... 25 more
2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:27073


Our HA config is:


high-availability: zookeeper
high-availability.cluster-id: 1.5-batch
high-availability.storageDir: file:///var/lib/flink/ceph//prod/1.5-batch/ha_state
high-availability.zookeeper.path.root: /1.5-batch
high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181


Any ideas what might be the probleme here?


Best,

Helmut
Reply | Threaded
Open this post in threaded view
|

Re: Flink Jobmanager Failover in HA mode

Dominik Wosiński
I have faced this issue, but in 1.4.0 IIRC. This seems to be related to https://issues.apache.org/jira/browse/FLINK-10011. What was the status of the jobs when the main Job Manager has been stopped ? 

2018-08-17 17:08 GMT+02:00 Helmut Zechmann <[hidden email]>:
Hi all,

we have a problem with flink 1.5.2 high availability in standalone mode.

We have two jobmanagers running. When I shut down the main job manager, the failover job manager encounters an error during failover.

Logs:


2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: <a href="http://seg-1.adjust.com/178.162.219.66:29095 2018-08-17" rel="noreferrer" target="_blank">seg-1.adjust.com/178.162.219.66:29095
2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@seg-1.adjust.com:29095]] Caused by: [Connection refused: seg-1.adjust.com/178.162.219.66:29095]
2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        [... shortened ...]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
        ... 9 more
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://seg-2.adjust.com:8083 was granted leadership with leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@....com:30169/user/resourcemanager was granted leadership with fencing token 8de829de14876a367a80d37194b944ee
2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@....com:30169/user/dispatcher was granted leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        [... shortened ...]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
        at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
        at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
        at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
        at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
        ... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
        ... 25 more
Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        [... shortened ...]
        ... 25 more
2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:27073


Our HA config is:


high-availability: zookeeper
high-availability.cluster-id: 1.5-batch
high-availability.storageDir: file:///var/lib/flink/ceph//prod/1.5-batch/ha_state
high-availability.zookeeper.path.root: /1.5-batch
high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181


Any ideas what might be the probleme here?


Best,

Helmut

Reply | Threaded
Open this post in threaded view
|

Re: Flink Jobmanager Failover in HA mode

Helmut Zechmann-2
Hi Dominik,

all jobs on the cluster (batch only jobs without state) where in status FINISHED.

Best,

Helmut

On Fri, Aug 17, 2018 at 8:04 PM Dominik Wosiński <[hidden email]> wrote:
I have faced this issue, but in 1.4.0 IIRC. This seems to be related to https://issues.apache.org/jira/browse/FLINK-10011. What was the status of the jobs when the main Job Manager has been stopped ? 

2018-08-17 17:08 GMT+02:00 Helmut Zechmann <[hidden email]>:
Hi all,

we have a problem with flink 1.5.2 high availability in standalone mode.

We have two jobmanagers running. When I shut down the main job manager, the failover job manager encounters an error during failover.

Logs:


2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@...:29095] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: seg-1.adjust.com/178.162.219.66:29095
2018-08-17
14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@...:29095] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@...:29095]] Caused by: [Connection refused: seg-1.adjust.com/178.162.219.66:29095]
2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@...:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        [... shortened ...]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@...:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
        ... 9 more
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://seg-2.adjust.com:8083 was granted leadership with leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@...:30169/user/resourcemanager was granted leadership with fencing token 8de829de14876a367a80d37194b944ee
2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@...:30169/user/dispatcher was granted leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        [... shortened ...]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
        at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
        at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
        at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
        at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
        ... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
        ... 25 more
Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        [... shortened ...]
        ... 25 more
2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:27073


Our HA config is:


high-availability: zookeeper
high-availability.cluster-id: 1.5-batch
high-availability.storageDir: file:///var/lib/flink/ceph//prod/1.5-batch/ha_state
high-availability.zookeeper.path.root: /1.5-batch
high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181


Any ideas what might be the probleme here?


Best,

Helmut