Zookeeper Session Timeout

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Zookeeper Session Timeout

snntr
Hi everyone,

I observed the following behavior with Flink 1.0.2 on Hadoop 2.4.1 with
a yarn session in HA mode:

2016-05-10 18:39:14,546 INFO  org.apache.zookeeper.ClientCnxn
                    - Client session timed out, have not heard from
server in 52444ms for sessionid 0x2544821cf2f818a, closing socket
connection and attempting reconnect

2016-05-10 18:39:14,546 INFO  org.apache.zookeeper.ClientCnxn
                    - Client session timed out, have not heard from
server in 54871ms for sessionid 0x154481fce7881c8, closing socket
connection and attempting reconnect

2016-05-10 18:39:14,730 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED

2016-05-10 18:39:14,872 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED

2016-05-10 18:39:14,907 WARN
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are
not monitored (temporarily).

2016-05-10 18:39:14,943 INFO  org.apache.flink.yarn.YarnJobManager
                    - JobManager akka://flink/user/jobmanager#1292460688
was revoked leadership.


I am confused about the timeouts of roughly 50,000ms as the
flink-conf.yml states:

> reocvery.zookeeper.client.connection-timeout: 30000
> recovery.zookeeper.client.session-timeout: 120000
> recovery.zookeeper.client.retry-wait: 5000
> recovery.zookeeper.client.max-retry-attempts: 5

So I would have expected a timeout of around 120,000ms. 50,000ms is our
configured akka.watch.heartbeat.interval. Is this value used instead here?

Cheers,

Konstantin

--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper Session Timeout

Till Rohrmann-2
Hi Konstantin,

I've checked and the CuratorFramework client should be started with the correct session timeout (see ZooKeeperUtils.java:90). However, the ZooKeeper server has a min and max session timeout value (http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html). This interval limits the actual session timeout the client can negotiate. I don't think that the max session timeout is 50s but maybe you could check this.

The akka.watch.heartbeat.interval should not be used by ZooKeeper.

Cheers,
Till

On Wed, May 11, 2016 at 1:12 PM, Konstantin Knauf <[hidden email]> wrote:
Hi everyone,

I observed the following behavior with Flink 1.0.2 on Hadoop 2.4.1 with
a yarn session in HA mode:

2016-05-10 18:39:14,546 INFO  org.apache.zookeeper.ClientCnxn
                    - Client session timed out, have not heard from
server in 52444ms for sessionid 0x2544821cf2f818a, closing socket
connection and attempting reconnect

2016-05-10 18:39:14,546 INFO  org.apache.zookeeper.ClientCnxn
                    - Client session timed out, have not heard from
server in 54871ms for sessionid 0x154481fce7881c8, closing socket
connection and attempting reconnect

2016-05-10 18:39:14,730 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED

2016-05-10 18:39:14,872 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED

2016-05-10 18:39:14,907 WARN
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are
not monitored (temporarily).

2016-05-10 18:39:14,943 INFO  org.apache.flink.yarn.YarnJobManager
                    - JobManager akka://flink/user/jobmanager#1292460688
was revoked leadership.


I am confused about the timeouts of roughly 50,000ms as the
flink-conf.yml states:

> reocvery.zookeeper.client.connection-timeout: 30000
> recovery.zookeeper.client.session-timeout: 120000
> recovery.zookeeper.client.retry-wait: 5000
> recovery.zookeeper.client.max-retry-attempts: 5

So I would have expected a timeout of around 120,000ms. 50,000ms is our
configured akka.watch.heartbeat.interval. Is this value used instead here?

Cheers,

Konstantin

--
Konstantin Knauf * [hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082