Should flink job manager crash during zookeeper upgrade?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Should flink job manager crash during zookeeper upgrade?

Barisa Obradovic
I'm trying to understand if behaviour of the flink jobmanager during
zookeeper upgrade is expected or not.

I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta.
While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper downtime.
I'd expect to either flink job to restart or few warnings in the logs during
those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod
restart).

I expected for flink to internally retry zookeeper requests, so I'm
surprised it crashes. Is this expected, or is it a bug?

From the logs

org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:00.198 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:02.294 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:02.295 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:02.295 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:03.841 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:03.842 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:03.842 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.rea



FYI: I've asked same question on stackoverflow:
https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Should flink job manager crash during zookeeper upgrade?

Matthias
Hi Barisa,
thanks for sharing this. I'm gonna add Till to this thread. He might have some insights.

Best,
Matthias

On Wed, Feb 10, 2021 at 4:19 PM Barisa Obradovic <[hidden email]> wrote:
I'm trying to understand if behaviour of the flink jobmanager during
zookeeper upgrade is expected or not.

I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta.
While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper downtime.
I'd expect to either flink job to restart or few warnings in the logs during
those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod
restart).

I expected for flink to internally retry zookeeper requests, so I'm
surprised it crashes. Is this expected, or is it a bug?

From the logs

org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:00.198 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:02.294 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:02.295 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:02.295 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:03.841 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:03.842 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:03.842 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.rea



FYI: I've asked same question on stackoverflow:
https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Should flink job manager crash during zookeeper upgrade?

Barisa Obradovic
Great, thank you for help Matthias



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Should flink job manager crash during zookeeper upgrade?

Till Rohrmann
Hi Barisa,

Could you give us the full logs of the run? It looks a bit that you exceeded the maximum retry attempts while you upgraded your ZooKeeper cluster. You can increase it via recovery.zookeeper.client.retry-wait and recovery.zookeeper.client.max-retry-attempts.

From Flink's perspective it is intended that the system fails after some time when it cannot connect to the ZooKeeper cluster.

Cheers,
Till

On Wed, Feb 10, 2021 at 10:43 PM Barisa Obradovic <[hidden email]> wrote:
Great, thank you for help Matthias



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Should flink job manager crash during zookeeper upgrade?

Barisa Obradovic
Thank you Till, that's perfect.
I increased the max retry attempts a bit, and now it works like a charm ( no
restarts ).




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/