ZooKeeper connection SUSPENDING

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ZooKeeper connection SUSPENDING

Kenzyme
Hi,

Related to https://mail-archives.apache.org/mod_mbox/flink-dev/201709.mbox/%3CCA+faj9yvPyzmmLoEWAMPgXDP6kx+0oed1Z5k4s3K9sgiCFyb=w@...%3E and https://issues.apache.org/jira/browse/FLINK-10052, I was wondering if there's a way to prevent Flink instances from failing while doing a rolling restart on ZK followers while still keeping the quorum?

This is what was shown in Flink logs while restarting ZK :
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).

I was able to reproduce this twice with a quorum of 5 ZK nodes while doing some ZK maintenance.

Thanks!

Kenzyme Le


Reply | Threaded
Open this post in threaded view
|

Re: ZooKeeper connection SUSPENDING

r_khachatryan
Hi,

AFAIK, the features discussed in the threads you mentioned are not yet implemented. So there is no way to avoid Job restarts in case of ZK rolling restarts.
I'm pulling in Till as he might know better.

Regards,
Roman


On Fri, Oct 16, 2020 at 7:45 PM Kenzyme <[hidden email]> wrote:
Hi,

Related to https://mail-archives.apache.org/mod_mbox/flink-dev/201709.mbox/%3CCA+faj9yvPyzmmLoEWAMPgXDP6kx+0oed1Z5k4s3K9sgiCFyb=w@...%3E and https://issues.apache.org/jira/browse/FLINK-10052, I was wondering if there's a way to prevent Flink instances from failing while doing a rolling restart on ZK followers while still keeping the quorum?

This is what was shown in Flink logs while restarting ZK :
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).

I was able to reproduce this twice with a quorum of 5 ZK nodes while doing some ZK maintenance.

Thanks!

Kenzyme Le


Reply | Threaded
Open this post in threaded view
|

Re: ZooKeeper connection SUSPENDING

Kenzyme
Hi Roman,

Thank you for your reply.

I'm not 100% sure if those features discussed in the threads will fix the issue, but they seemed related in some way.

Basically, the expected behaviour I had for Flink was similar to how Kafka works i.e.  Kafka services continues w/o disruption whenever ZK quorum is maintained during rolling updates.

Best,

Kenzyme Le


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, October 19th, 2020 at 4:38 PM, Khachatryan Roman <[hidden email]> wrote:
Hi,

AFAIK, the features discussed in the threads you mentioned are not yet implemented. So there is no way to avoid Job restarts in case of ZK rolling restarts.
I'm pulling in Till as he might know better.

Regards,
Roman


On Fri, Oct 16, 2020 at 7:45 PM Kenzyme <[hidden email]> wrote:
Hi,

Related to https://mail-archives.apache.org/mod_mbox/flink-dev/201709.mbox/%3CCA+faj9yvPyzmmLoEWAMPgXDP6kx+0oed1Z5k4s3K9sgiCFyb=w@...%3E and https://issues.apache.org/jira/browse/FLINK-10052, I was wondering if there's a way to prevent Flink instances from failing while doing a rolling restart on ZK followers while still keeping the quorum?

This is what was shown in Flink logs while restarting ZK :
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).

I was able to reproduce this twice with a quorum of 5 ZK nodes while doing some ZK maintenance.

Thanks!

Kenzyme Le



Reply | Threaded
Open this post in threaded view
|

Re: ZooKeeper connection SUSPENDING

Till Rohrmann
Hi Kenzyme,

at the moment Flink will stop the execution of jobs when it loses its connection to ZooKeeper for whatever reason. If a ZK rolling update can cause the connection loss to the quorum, then it's what you are seeing. FLINK-10052 wants to add a feature which allows Flink to tolerate a SUSPENDED ZK connection for a short amount of time. I haven't tried it out with a rolling ZK update but it might solve the problem you are observing.

Cheers,
Till

On Tue, Oct 20, 2020 at 5:41 AM Kenzyme <[hidden email]> wrote:
Hi Roman,

Thank you for your reply.

I'm not 100% sure if those features discussed in the threads will fix the issue, but they seemed related in some way.

Basically, the expected behaviour I had for Flink was similar to how Kafka works i.e.  Kafka services continues w/o disruption whenever ZK quorum is maintained during rolling updates.

Best,

Kenzyme Le


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, October 19th, 2020 at 4:38 PM, Khachatryan Roman <[hidden email]> wrote:
Hi,

AFAIK, the features discussed in the threads you mentioned are not yet implemented. So there is no way to avoid Job restarts in case of ZK rolling restarts.
I'm pulling in Till as he might know better.

Regards,
Roman


On Fri, Oct 16, 2020 at 7:45 PM Kenzyme <[hidden email]> wrote:
Hi,

Related to https://mail-archives.apache.org/mod_mbox/flink-dev/201709.mbox/%3CCA+faj9yvPyzmmLoEWAMPgXDP6kx+0oed1Z5k4s3K9sgiCFyb=w@...%3E and https://issues.apache.org/jira/browse/FLINK-10052, I was wondering if there's a way to prevent Flink instances from failing while doing a rolling restart on ZK followers while still keeping the quorum?

This is what was shown in Flink logs while restarting ZK :
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).

I was able to reproduce this twice with a quorum of 5 ZK nodes while doing some ZK maintenance.

Thanks!

Kenzyme Le



Reply | Threaded
Open this post in threaded view
|

Re: ZooKeeper connection SUSPENDING

Kenzyme
Hi Till,

Thank you for the explanation.

Looking forward for this feature to be added in the near future.

Best,

Kenzyme



-------- Original Message --------
On Oct. 20, 2020, 8:46 a.m., Till Rohrmann < [hidden email]> wrote:

Hi Kenzyme,

at the moment Flink will stop the execution of jobs when it loses its connection to ZooKeeper for whatever reason. If a ZK rolling update can cause the connection loss to the quorum, then it's what you are seeing. FLINK-10052 wants to add a feature which allows Flink to tolerate a SUSPENDED ZK connection for a short amount of time. I haven't tried it out with a rolling ZK update but it might solve the problem you are observing.

Cheers,
Till

On Tue, Oct 20, 2020 at 5:41 AM Kenzyme <[hidden email]> wrote:
Hi Roman,

Thank you for your reply.

I'm not 100% sure if those features discussed in the threads will fix the issue, but they seemed related in some way.

Basically, the expected behaviour I had for Flink was similar to how Kafka works i.e.  Kafka services continues w/o disruption whenever ZK quorum is maintained during rolling updates.

Best,

Kenzyme Le


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, October 19th, 2020 at 4:38 PM, Khachatryan Roman <[hidden email]> wrote:
Hi,

AFAIK, the features discussed in the threads you mentioned are not yet implemented. So there is no way to avoid Job restarts in case of ZK rolling restarts.
I'm pulling in Till as he might know better.

Regards,
Roman


On Fri, Oct 16, 2020 at 7:45 PM Kenzyme <[hidden email]> wrote:
Hi,

Related to https://mail-archives.apache.org/mod_mbox/flink-dev/201709.mbox/%3CCA+faj9yvPyzmmLoEWAMPgXDP6kx+0oed1Z5k4s3K9sgiCFyb=w@...%3E and https://issues.apache.org/jira/browse/FLINK-10052, I was wondering if there's a way to prevent Flink instances from failing while doing a rolling restart on ZK followers while still keeping the quorum?

This is what was shown in Flink logs while restarting ZK :
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).

I was able to reproduce this twice with a quorum of 5 ZK nodes while doing some ZK maintenance.

Thanks!

Kenzyme Le