(DEPRECATED) Apache Flink User Mailing List archive.

Issue with single job yarn flink cluster HA

Classic

List

Threaded

10 messages Options

Dinesh J

Issue with single job yarn flink cluster HA

Hi all,

We have single job yarn flink cluster setup with High Availability.

Sometimes job manager failure successfully restarts next attempt from current checkpoint.

But occasionally we are getting below error.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1

Below is the flink configuration

high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?

Is it something related to hostname ip mapping issue or zookeeper version issue?

Thanks,

Dinesh

Dinesh J

Re: Issue with single job yarn flink cluster HA

Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,

Dinesh

On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:

Hi all,

We have single job yarn flink cluster setup with High Availability.

Sometimes job manager failure successfully restarts next attempt from current checkpoint.

But occasionally we are getting below error.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1

Below is the flink configuration

high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?

Is it something related to hostname ip mapping issue or zookeeper version issue?

Thanks,

Dinesh

Andrey Zagrebin-5

Re: Issue with single job yarn flink cluster HA

Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.

They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,

Andrey

On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:

Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Dinesh J

Re: Issue with single job yarn flink cluster HA

Hi Andrey,

Yes . The job is not restarting sometimes after the current leader failure.

Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.

During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,

Dinesh

On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:

Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Yang Wang

Re: Issue with single job yarn flink cluster HA

Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader

election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide

the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that

the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still

hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old

jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect

successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.

It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"

and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value

will also cause unexpected failover because of network problem.

Best,

Yang

Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：

Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Dinesh J

Re: Issue with single job yarn flink cluster HA

HI Yang,

Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever.

Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this issue happens mainly during heavy load in cluster?

Best,

Dinesh

On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[hidden email]> wrote:

Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader
election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide
the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that
the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still
hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old
jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect
successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.
It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"
and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value
will also cause unexpected failover because of network problem.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：
Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Yang Wang

Re: Issue with single job yarn flink cluster HA

I think your problem is not about akka timeout. Increase the timeout could help in a

heavy load cluster, especially for the network is not very good. However, that is not

your case now.

I am not sure about the "never recovery". Do you mean the logs "Connection refused"

keep going and do not have other logs? How long does it stay in "leader election onging".

Usually, it takes at most 60s. Since if the old jobmanager crashed, then it will lose

the leadership after zookeeper session timeout. So when the new jobmanager always

could not grant the leadership, it may because of some problem of zookeeper.

Maybe you need to share the complete jobmanager logs so that we could know what

is happening in the jobmanager.

Best,

Yang

Dinesh J <[hidden email]> 于2020年3月31日周二上午3:46写道：

HI Yang,
Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever.
Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this issue happens mainly during heavy load in cluster?

Best,
Dinesh
On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[hidden email]> wrote:
Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader
election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide
the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that
the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still
hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old
jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect
successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.
It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"
and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value
will also cause unexpected failover because of network problem.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：
Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Dinesh J

Re: Issue with single job yarn flink cluster HA

Hi Yang,

I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint.

Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever.

Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?

Thanks,

Dinesh

On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <[hidden email]> wrote:

I think your problem is not about akka timeout. Increase the timeout could help in a
heavy load cluster, especially for the network is not very good. However, that is not
your case now.

I am not sure about the "never recovery". Do you mean the logs "Connection refused"
keep going and do not have other logs? How long does it stay in "leader election onging".
Usually, it takes at most 60s. Since if the old jobmanager crashed, then it will lose
the leadership after zookeeper session timeout. So when the new jobmanager always
could not grant the leadership, it may because of some problem of zookeeper.

Maybe you need to share the complete jobmanager logs so that we could know what
is happening in the jobmanager.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月31日周二上午3:46写道：
HI Yang,
Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever.
Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this issue happens mainly during heavy load in cluster?

Best,
Dinesh
On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[hidden email]> wrote:
Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader
election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide
the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that
the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still
hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old
jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect
successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.
It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"
and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value
will also cause unexpected failover because of network problem.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：
Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

full_log_failed_container.log (68K) Download Attachment

Andrey Zagrebin-5

Re: Issue with single job yarn flink cluster HA

Hi Dinesh,

Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2].

I would suggest to try Flink 1.10.

If the problem persists, could you also find the logs of the failed Job Manager before the failover?

Best,

Andrey

[1] https://jira.apache.org/jira/browse/FLINK-14316

[2] https://jira.apache.org/jira/browse/FLINK-11843

On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[hidden email]> wrote:

Hi Yang,
I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint.
Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever.
Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?

Thanks,
Dinesh
On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <[hidden email]> wrote:
I think your problem is not about akka timeout. Increase the timeout could help in a
heavy load cluster, especially for the network is not very good. However, that is not
your case now.

I am not sure about the "never recovery". Do you mean the logs "Connection refused"
keep going and do not have other logs? How long does it stay in "leader election onging".
Usually, it takes at most 60s. Since if the old jobmanager crashed, then it will lose
the leadership after zookeeper session timeout. So when the new jobmanager always
could not grant the leadership, it may because of some problem of zookeeper.

Maybe you need to share the complete jobmanager logs so that we could know what
is happening in the jobmanager.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月31日周二上午3:46写道：
HI Yang,
Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever.
Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this issue happens mainly during heavy load in cluster?

Best,
Dinesh
On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[hidden email]> wrote:
Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader
election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide
the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that
the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still
hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old
jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect
successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.
It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"
and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value
will also cause unexpected failover because of network problem.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：
Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh

Dinesh J

Re: Issue with single job yarn flink cluster HA

Hi Andrey,

Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed and update in this thread.

Thanks,

Dinesh

On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin <[hidden email]> wrote:

Hi Dinesh,

Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2].
I would suggest to try Flink 1.10.
If the problem persists, could you also find the logs of the failed Job Manager before the failover?

Best,
Andrey

[1] https://jira.apache.org/jira/browse/FLINK-14316
[2] https://jira.apache.org/jira/browse/FLINK-11843
On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[hidden email]> wrote:
Hi Yang,
I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint.
Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever.
Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?

Thanks,
Dinesh
On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <[hidden email]> wrote:
I think your problem is not about akka timeout. Increase the timeout could help in a
heavy load cluster, especially for the network is not very good. However, that is not
your case now.

I am not sure about the "never recovery". Do you mean the logs "Connection refused"
keep going and do not have other logs? How long does it stay in "leader election onging".
Usually, it takes at most 60s. Since if the old jobmanager crashed, then it will lose
the leadership after zookeeper session timeout. So when the new jobmanager always
could not grant the leadership, it may because of some problem of zookeeper.

Maybe you need to share the complete jobmanager logs so that we could know what
is happening in the jobmanager.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月31日周二上午3:46写道：
HI Yang,
Thanks for the clarification and suggestion. But my problem was that recovery never happens and the message "leader election ongoing" is what the message displayed forever.
Do you think increasing akka.ask.timeout and akka.tcp.timeout will help in case of a heavy/highload cluster as this issue happens mainly during heavy load in cluster?

Best,
Dinesh
On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[hidden email]> wrote:
Hi Dinesh,

First, i think the error message your provided is not a problem. It just indicates that the leader
election is still ongoing. When it finished, the new leader will start the a new dispatcher to provide
the webui and rest service.

From your jobmanager logs "Connection refused: host1/ipaddress1:28681", we could know that
the old jobmanager has failed. When a new jobmanager started, since the old jobmanager still
hold the lock of leader latch. So Flink tries to connect with it. After it tries few times, since the old
jobmanager zookeeper client do not update the leader latch, then the new jobmanager will elect
successfully and be the active leader. It is just how the leader election works.

In a nutshell, the root cause is old jobmanager crashed and it does not lose the leader immediately.
It is the by-design behavior.

If you really want to make the recovery faster, i think you could decrease "high-availability.zookeeper.client.connection-timeout"
and "high-availability.zookeeper.client.session-timeout". Please keep in mind that too small value
will also cause unexpected failover because of network problem.

Best,
Yang
Dinesh J <[hidden email]> 于2020年3月25日周三下午4:20写道：
Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master url via yarn ui and message remains the same even if the yarn job is running for 2 days.
During this time , even current yarn application attempt is not getting failed and no containers are launched for jobmanager and taskmanager.

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

Thanks,
Dinesh
On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure?

Best,
Andrey
On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[hidden email]> wrote:
Attaching the job manager log for reference.

2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher.
2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681
2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]

Thanks,
Dinesh
On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[hidden email]> wrote:
Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
Below is the flink configuration
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s


Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh