Timeout error in ZooKeeper

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Timeout error in ZooKeeper

Samir Chauhan

Hi,

 

Yesterday morning I got below error in Zookeeper. After this error, my Flink did not connect to ZK and jobs went to hang state. I had to cancel and redeploy my all jobs to bring it to normal state.

2020-02-28 02:45:56,811 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1701028573403f3, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
        at java.lang.Thread.run(Thread.java:748)

At the same time I saw below error in Flink.

2020-02-28 02:46:49,095 ERROR org.apache.curator.ConnectionState                            - Connection timed out for connection string (zk-cs:2181) and timeout (3000) / elapsed (14305)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

      at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)

      at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)

      at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)

      at java.util.concurrent.FutureTask.run(FutureTask.java:266)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

      at java.lang.Thread.run(Thread.java:748)

 

Has anyone face similar error earlier.

 

My environment is

Azure Kubernetes 1.15.7

Flink 1.6.0

Zookeeper 3.4.10

 

Warm Regards,

Samir Chauhan

 

 


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.
Reply | Threaded
Open this post in threaded view
|

Re: Timeout error in ZooKeeper

Till Rohrmann
Hi Samir,

it is hard to tell what exactly happened without the Flink logs. However, newer Flink versions include some ZooKeeper improvements and fixes for some bugs [1]. Hence, it might make sense to try to upgrade your Flink version.


Cheers,
Till

On Fri, Feb 28, 2020 at 7:41 PM Samir Tusharbhai Chauhan <[hidden email]> wrote:

Hi,

 

Yesterday morning I got below error in Zookeeper. After this error, my Flink did not connect to ZK and jobs went to hang state. I had to cancel and redeploy my all jobs to bring it to normal state.

2020-02-28 02:45:56,811 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1701028573403f3, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
        at java.lang.Thread.run(Thread.java:748)

At the same time I saw below error in Flink.

2020-02-28 02:46:49,095 ERROR org.apache.curator.ConnectionState                            - Connection timed out for connection string (zk-cs:2181) and timeout (3000) / elapsed (14305)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

      at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)

      at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)

      at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)

      at java.util.concurrent.FutureTask.run(FutureTask.java:266)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

      at java.lang.Thread.run(Thread.java:748)

 

Has anyone face similar error earlier.

 

My environment is

Azure Kubernetes 1.15.7

Flink 1.6.0

Zookeeper 3.4.10

 

Warm Regards,

Samir Chauhan

 

 


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.
Reply | Threaded
Open this post in threaded view
|

RE: Timeout error in ZooKeeper

Samir Chauhan

Hi [hidden email],

 

Thanks for the response. Unfortunately I could not capture much log on Flink side. I am still attaching whatever I could collect.

 

I found this old ticket on same error. Not sure if this is related anyway.

https://issues.apache.org/jira/browse/ZOOKEEPER-1582

 

Somewhere I also read that it could be related to Znodes that ZNodes containing too much data or having too many children. By default ZooKeeper has a 1 MB transport limit.

 

Warm Regards,

Samir Chauhan

 

Regional Infrastructure & Operations

 

cid:image002.png@01D12B8E.C23F3E10

 

Prudential Services Singapore Pte Ltd

1 Wallich Street #19-01, Guoco Tower Singapore 078881

 

Direct (65) 6704 7264 Mobile (65) 9721 7548

[hidden email]

 

www.prudential.com.sg

 

From: Till Rohrmann <[hidden email]>
Sent: Saturday, February 29, 2020 11:28 PM
To: Samir Tusharbhai Chauhan <[hidden email]>
Cc: [hidden email]
Subject: Re: Timeout error in ZooKeeper

 

Hi Samir,

 

it is hard to tell what exactly happened without the Flink logs. However, newer Flink versions include some ZooKeeper improvements and fixes for some bugs [1]. Hence, it might make sense to try to upgrade your Flink version.

 

 

Cheers,

Till

 

On Fri, Feb 28, 2020 at 7:41 PM Samir Tusharbhai Chauhan <[hidden email]> wrote:

Hi,

 

Yesterday morning I got below error in Zookeeper. After this error, my Flink did not connect to ZK and jobs went to hang state. I had to cancel and redeploy my all jobs to bring it to normal state.

2020-02-28 02:45:56,811 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1701028573403f3, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
        at java.lang.Thread.run(Thread.java:748)

At the same time I saw below error in Flink.

2020-02-28 02:46:49,095 ERROR org.apache.curator.ConnectionState                            - Connection timed out for connection string (zk-cs:2181) and timeout (3000) / elapsed (14305)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

      at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)

      at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)

      at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)

      at java.util.concurrent.FutureTask.run(FutureTask.java:266)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

      at java.lang.Thread.run(Thread.java:748)

 

Has anyone face similar error earlier.

 

My environment is

Azure Kubernetes 1.15.7

Flink 1.6.0

Zookeeper 3.4.10

 

Warm Regards,

Samir Chauhan

 

 


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.

flink.log (545K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Timeout error in ZooKeeper

Yang Wang
Hi Samir.

It seems that your zookeeper connection timeout is set to 3000ms. And it did not
connect to server for 14305ms, maybe due to full gc or network problem. When
it reconnected, the "ConnectionLossException" will be thrown.


So have you ever change the zookeeper client related timeout configurations in Flink?
Or could you confirm the zookeeper server side timeout settings?


Best,
Yang

Samir Tusharbhai Chauhan <[hidden email]> 于2020年3月1日周日 上午12:57写道:

Hi [hidden email],

 

Thanks for the response. Unfortunately I could not capture much log on Flink side. I am still attaching whatever I could collect.

 

I found this old ticket on same error. Not sure if this is related anyway.

https://issues.apache.org/jira/browse/ZOOKEEPER-1582

 

Somewhere I also read that it could be related to Znodes that ZNodes containing too much data or having too many children. By default ZooKeeper has a 1 MB transport limit.

 

Warm Regards,

Samir Chauhan

 

Regional Infrastructure & Operations

 

cid:image002.png@01D12B8E.C23F3E10

 

Prudential Services Singapore Pte Ltd

1 Wallich Street #19-01, Guoco Tower Singapore 078881

 

Direct (65) 6704 7264 Mobile (65) 9721 7548

[hidden email]

 

www.prudential.com.sg

 

From: Till Rohrmann <[hidden email]>
Sent: Saturday, February 29, 2020 11:28 PM
To: Samir Tusharbhai Chauhan <[hidden email]>
Cc: [hidden email]
Subject: Re: Timeout error in ZooKeeper

 

Hi Samir,

 

it is hard to tell what exactly happened without the Flink logs. However, newer Flink versions include some ZooKeeper improvements and fixes for some bugs [1]. Hence, it might make sense to try to upgrade your Flink version.

 

 

Cheers,

Till

 

On Fri, Feb 28, 2020 at 7:41 PM Samir Tusharbhai Chauhan <[hidden email]> wrote:

Hi,

 

Yesterday morning I got below error in Zookeeper. After this error, my Flink did not connect to ZK and jobs went to hang state. I had to cancel and redeploy my all jobs to bring it to normal state.

2020-02-28 02:45:56,811 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1701028573403f3, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
        at java.lang.Thread.run(Thread.java:748)

At the same time I saw below error in Flink.

2020-02-28 02:46:49,095 ERROR org.apache.curator.ConnectionState                            - Connection timed out for connection string (zk-cs:2181) and timeout (3000) / elapsed (14305)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

      at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)

      at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)

      at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)

      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)

      at java.util.concurrent.FutureTask.run(FutureTask.java:266)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

      at java.lang.Thread.run(Thread.java:748)

 

Has anyone face similar error earlier.

 

My environment is

Azure Kubernetes 1.15.7

Flink 1.6.0

Zookeeper 3.4.10

 

Warm Regards,

Samir Chauhan

 

 


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.


There's a reason we support Fair Dealing. YOU.


This email and any files transmitted with it or attached to it (the [Email]) may contain confidential, proprietary or legally privileged information and is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient of the Email, you must not, directly or indirectly, copy, use, print, distribute, disclose to any other party or take any action in reliance on any part of the Email. Please notify the system manager or sender of the error and delete all copies of the Email immediately.

No statement in the Email should be construed as investment advice being given within or outside Singapore. Prudential Assurance Company Singapore (Pte) Limited (PACS) and each of its related entities shall not be responsible for any losses, claims, penalties, costs or damages arising from or in connection with the use of the Email or the information therein, in whole or in part. You are solely responsible for conducting any virus checks prior to opening, accessing or disseminating the Email.

PACS (Company Registration No. 199002477Z) is a company incorporated under the laws of Singapore and has its registered office at 30 Cecil Street, #30-01, Prudential Tower, Singapore 049712.

PACS is an indirect wholly owned subsidiary of Prudential plc of the United Kingdom. PACS and Prudential plc are not affiliated in any manner with Prudential Financial, Inc., a company whose principal place of business is in the United States of America.