I'm trying to understand if behaviour of the flink jobmanager during
zookeeper upgrade is expected or not. I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta. While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper downtime. I'd expect to either flink job to restart or few warnings in the logs during those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod restart). I expected for flink to internally retry zookeeper requests, so I'm surprised it crashes. Is this expected, or is it a bug? From the logs org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] [09-Feb-2021 11:30:00.197 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 [09-Feb-2021 11:30:00.197 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session [09-Feb-2021 11:30:00.198 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_192] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] [09-Feb-2021 11:30:02.294 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 [09-Feb-2021 11:30:02.295 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session [09-Feb-2021 11:30:02.295 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_192] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] [09-Feb-2021 11:30:03.841 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 [09-Feb-2021 11:30:03.842 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session [09-Feb-2021 11:30:03.842 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192] at sun.nio.ch.IOUtil.rea FYI: I've asked same question on stackoverflow: https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Barisa,
thanks for sharing this. I'm gonna add Till to this thread. He might have some insights. Best, Matthias On Wed, Feb 10, 2021 at 4:19 PM Barisa Obradovic <[hidden email]> wrote: I'm trying to understand if behaviour of the flink jobmanager during |
Great, thank you for help Matthias
-- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Barisa, Could you give us the full logs of the run? It looks a bit that you exceeded the maximum retry attempts while you upgraded your ZooKeeper cluster. You can increase it via recovery.zookeeper.client.retry-wait and recovery.zookeeper.client.max-retry-attempts. From Flink's perspective it is intended that the system fails after some time when it cannot connect to the ZooKeeper cluster. Cheers, Till On Wed, Feb 10, 2021 at 10:43 PM Barisa Obradovic <[hidden email]> wrote: Great, thank you for help Matthias |
Thank you Till, that's perfect.
I increased the max retry attempts a bit, and now it works like a charm ( no restarts ). -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |