Flink + Consul as HA backend. What do you think?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink + Consul as HA backend. What do you think?

Krzysztof Białek
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof

Reply | Threaded
Open this post in threaded view
|

Re: Flink + Consul as HA backend. What do you think?

Chesnay Schepler
Hello,

I don't know anything about Consul but the prospect of having other options beside Zookeeper is very interesting. It's rather surprising how little you had to modify existing classes to get this to work.

It may take a bit until someone provides proper feedback as the community is currently prepping 2 releases (1.4.1 and 1.5), please don't be discouraged by this.

I saw that your branch was based on the 1.4 version. In 1.5 we reworked the distributed architecture of Flink (in an initiative commonly referred to as FLIP-6) which may affect your work.

2 things to note from my side:
It would also be helpful if you could explain the differences between ZK and Consul and how they stack up in terms of guarantees etc. .
How did you test your solution so far? (Like how long was a cluster running, what failure scenarios)

On 13.02.2018 21:38, Krzysztof Białek wrote:
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof


Reply | Threaded
Open this post in threaded view
|

Re: Flink + Consul as HA backend. What do you think?

Krzysztof Białek
I have very little experience with ZK and cannot explain the differences between ZK and Consul by myself. However there are some comparisions available:
https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may be biased

Regarding testing - I did basic failover scenarios on my workstation with 2 JobManagers, 2 TaskManagers and WindowJoin example app with checkpointing and restarting turned on.
I was running the cluster no longer than for few hours.

For now I'd like to open Flink for alternative HA backends (https://issues.apache.org/jira/browse/FLINK-8660)


On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <[hidden email]> wrote:
Hello,

I don't know anything about Consul but the prospect of having other options beside Zookeeper is very interesting. It's rather surprising how little you had to modify existing classes to get this to work.

It may take a bit until someone provides proper feedback as the community is currently prepping 2 releases (1.4.1 and 1.5), please don't be discouraged by this.

I saw that your branch was based on the 1.4 version. In 1.5 we reworked the distributed architecture of Flink (in an initiative commonly referred to as FLIP-6) which may affect your work.

2 things to note from my side:
It would also be helpful if you could explain the differences between ZK and Consul and how they stack up in terms of guarantees etc. .
How did you test your solution so far? (Like how long was a cluster running, what failure scenarios)


On 13.02.2018 21:38, Krzysztof Białek wrote:
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof



Reply | Threaded
Open this post in threaded view
|

Re: Flink + Consul as HA backend. What do you think?

Krzysztof Białek
Alright, just came across the first real-life problem with my Consul HA implementation.
In Consul KV store there is a limit of 512kB per node and JobGraph of one of my apps exceeded it.
In ZK there seems to be similar zNode Limit = 1MB
How did you workaround it? Or maybe I serialize the JobGraph wrong?

On Thu, Feb 15, 2018 at 8:47 AM, Krzysztof Białek <[hidden email]> wrote:
I have very little experience with ZK and cannot explain the differences between ZK and Consul by myself. However there are some comparisions available:
https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may be biased

Regarding testing - I did basic failover scenarios on my workstation with 2 JobManagers, 2 TaskManagers and WindowJoin example app with checkpointing and restarting turned on.
I was running the cluster no longer than for few hours.

For now I'd like to open Flink for alternative HA backends (https://issues.apache.org/jira/browse/FLINK-8660)


On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <[hidden email]> wrote:
Hello,

I don't know anything about Consul but the prospect of having other options beside Zookeeper is very interesting. It's rather surprising how little you had to modify existing classes to get this to work.

It may take a bit until someone provides proper feedback as the community is currently prepping 2 releases (1.4.1 and 1.5), please don't be discouraged by this.

I saw that your branch was based on the 1.4 version. In 1.5 we reworked the distributed architecture of Flink (in an initiative commonly referred to as FLIP-6) which may affect your work.

2 things to note from my side:
It would also be helpful if you could explain the differences between ZK and Consul and how they stack up in terms of guarantees etc. .
How did you test your solution so far? (Like how long was a cluster running, what failure scenarios)


On 13.02.2018 21:38, Krzysztof Białek wrote:
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof




Reply | Threaded
Open this post in threaded view
|

Re: Flink + Consul as HA backend. What do you think?

Fabian Hueske-2
Hi,

all data is stored in a distributed file system or object store (HDFS, S3, Ceph, ...) and ZooKeeper only stores pointers to that data.

Cheers, Fabian

2018-02-15 11:08 GMT+01:00 Krzysztof Białek <[hidden email]>:
Alright, just came across the first real-life problem with my Consul HA implementation.
In Consul KV store there is a limit of 512kB per node and JobGraph of one of my apps exceeded it.
In ZK there seems to be similar zNode Limit = 1MB
How did you workaround it? Or maybe I serialize the JobGraph wrong?

On Thu, Feb 15, 2018 at 8:47 AM, Krzysztof Białek <[hidden email]> wrote:
I have very little experience with ZK and cannot explain the differences between ZK and Consul by myself. However there are some comparisions available:
https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may be biased

Regarding testing - I did basic failover scenarios on my workstation with 2 JobManagers, 2 TaskManagers and WindowJoin example app with checkpointing and restarting turned on.
I was running the cluster no longer than for few hours.

For now I'd like to open Flink for alternative HA backends (https://issues.apache.org/jira/browse/FLINK-8660)


On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <[hidden email]> wrote:
Hello,

I don't know anything about Consul but the prospect of having other options beside Zookeeper is very interesting. It's rather surprising how little you had to modify existing classes to get this to work.

It may take a bit until someone provides proper feedback as the community is currently prepping 2 releases (1.4.1 and 1.5), please don't be discouraged by this.

I saw that your branch was based on the 1.4 version. In 1.5 we reworked the distributed architecture of Flink (in an initiative commonly referred to as FLIP-6) which may affect your work.

2 things to note from my side:
It would also be helpful if you could explain the differences between ZK and Consul and how they stack up in terms of guarantees etc. .
How did you test your solution so far? (Like how long was a cluster running, what failure scenarios)


On 13.02.2018 21:38, Krzysztof Białek wrote:
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof





Reply | Threaded
Open this post in threaded view
|

Re: Flink + Consul as HA backend. What do you think?

Krzysztof Białek
Alright, I have checkpoints saving implemented that way. I will apply this same pattern to jobgraphs.


On Thu, Feb 15, 2018 at 11:13 AM, Fabian Hueske <[hidden email]> wrote:
Hi,

all data is stored in a distributed file system or object store (HDFS, S3, Ceph, ...) and ZooKeeper only stores pointers to that data.

Cheers, Fabian

2018-02-15 11:08 GMT+01:00 Krzysztof Białek <[hidden email]>:
Alright, just came across the first real-life problem with my Consul HA implementation.
In Consul KV store there is a limit of 512kB per node and JobGraph of one of my apps exceeded it.
In ZK there seems to be similar zNode Limit = 1MB
How did you workaround it? Or maybe I serialize the JobGraph wrong?

On Thu, Feb 15, 2018 at 8:47 AM, Krzysztof Białek <[hidden email]> wrote:
I have very little experience with ZK and cannot explain the differences between ZK and Consul by myself. However there are some comparisions available:
https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may be biased

Regarding testing - I did basic failover scenarios on my workstation with 2 JobManagers, 2 TaskManagers and WindowJoin example app with checkpointing and restarting turned on.
I was running the cluster no longer than for few hours.

For now I'd like to open Flink for alternative HA backends (https://issues.apache.org/jira/browse/FLINK-8660)


On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <[hidden email]> wrote:
Hello,

I don't know anything about Consul but the prospect of having other options beside Zookeeper is very interesting. It's rather surprising how little you had to modify existing classes to get this to work.

It may take a bit until someone provides proper feedback as the community is currently prepping 2 releases (1.4.1 and 1.5), please don't be discouraged by this.

I saw that your branch was based on the 1.4 version. In 1.5 we reworked the distributed architecture of Flink (in an initiative commonly referred to as FLIP-6) which may affect your work.

2 things to note from my side:
It would also be helpful if you could explain the differences between ZK and Consul and how they stack up in terms of guarantees etc. .
How did you test your solution so far? (Like how long was a cluster running, what failure scenarios)


On 13.02.2018 21:38, Krzysztof Białek wrote:
I'd like to get your opinion about this idea. I found related JIRA issue FLINK-2366, but it seems to be dead. To attract your attention I copy my comment here.

As an experiment I've implemented Flink HA on top of Consul. The implementation is working fine in the "lab" but is not battle tested yet. The source code is available at https://github.com/kbialek/flink/tree/feature/consul (flink-runtime package org.apache.flink.runtime.consul)

Why?. Generally I'd like to keep as less moving parts as possible. We do not have Zookeeper running, but Consul is already in place. And in the end freedom of choice is a good thing.

It would be great to see built-in Consul support in Flink someday, but if it is not expected then I suggest a little refactoring to open possibility to configure HighAvailabilityServicesFactory. As far as I can see this should be enough to inject any HA implementation.

Regards,
Krzysztof