(DEPRECATED) Apache Flink User Mailing List archive.

Flink rolling upgrade support

Classic

List

Threaded

12 messages Options

Andrew Hoblitzell

Flink rolling upgrade support

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Aljoscha Krettek

Re: Flink rolling upgrade support

Hi,

zero-downtime updates are currently not supported. What is supported in Flink right now is a savepoint-shutdown-restore cycle. With this, you first draw a savepoint (which is essentially a checkpoint with some meta data), then you cancel your job, then you do whatever you need to do (update machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint, then start a second Flink job from that savepoint, then shutdown the first job. With this, your data sinks would need to be able to handle being written to by 2 jobs at the same time, i.e. writes should probably be idempotent.

This is the link to the savepoint doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,

Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]> wrote:

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Stephan Ewen

Re: Flink rolling upgrade support

Hi Andrew!

Would be great to know if what Aljoscha described works for you. Ideally, this costs no more than a failure/recovery cycle, which one typically also gets with rolling upgrades.

Best,

Stephan

On Tue, Dec 20, 2016 at 6:27 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
zero-downtime updates are currently not supported. What is supported in Flink right now is a savepoint-shutdown-restore cycle. With this, you first draw a savepoint (which is essentially a checkpoint with some meta data), then you cancel your job, then you do whatever you need to do (update machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint, then start a second Flink job from that savepoint, then shutdown the first job. With this, your data sinks would need to be able to handle being written to by 2 jobs at the same time, i.e. writes should probably be idempotent.

This is the link to the savepoint doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]> wrote:
Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Ron Crocker

Re: Flink rolling upgrade support

Hi Stephan -

I agree that the savepoint-shutdown-restart model is nominally the same as the rolling restart with one notable exception - a lack of atomicity. There is a gap between invoking the savepoint command and the shutdown command. My problem isn’t fortunate enough to have idempotent operations: replaying events ends up double-counting. With the current model (or at least as far as I can tell from the documentation you linked) I will double-process some events that are slightly after the savepoint.

One thing that could alleviate this is an atomic shutdown-with-savepoint (or savepoint-with-shutdown, I’m not so picky about which way it is, I only want it to be atomic). With this, I can be assured that the savepoint matches the actual last-processed state.

My understanding of the processing within Flink is that this could be modeled by a “savepoint” event followed by a “shutdown” event into the event stream, but my understanding is a bit cartoonish so I’m sure it’s more involved.

Ron

—

Ron Crocker

Principal Engineer & Architect

( ( •)) New Relic

[hidden email]

M: +1 630 363 8835

On Dec 20, 2016, at 12:40 PM, Stephan Ewen <[hidden email]> wrote:

Hi Andrew!

Would be great to know if what Aljoscha described works for you. Ideally, this costs no more than a failure/recovery cycle, which one typically also gets with rolling upgrades.

Best,
Stephan

Greg Hogan

Re: Flink rolling upgrade support

In reply to this post by Aljoscha Krettek

Aljoscha,

For the second, possible solution is there also a requirement that the data sinks handle out-of-order writes? If the new job outpaces the old job which is then terminated, the final write from the old job could have overwritten "newer" writes from the new job.

Greg

On Tue, Dec 20, 2016 at 12:27 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
zero-downtime updates are currently not supported. What is supported in Flink right now is a savepoint-shutdown-restore cycle. With this, you first draw a savepoint (which is essentially a checkpoint with some meta data), then you cancel your job, then you do whatever you need to do (update machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint, then start a second Flink job from that savepoint, then shutdown the first job. With this, your data sinks would need to be able to handle being written to by 2 jobs at the same time, i.e. writes should probably be idempotent.

This is the link to the savepoint doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]> wrote:
Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Aljoscha Krettek

Re: Flink rolling upgrade support

Hi Greg,

yes certainly, there are more requirements to this than the quick sketch I gave above and that seems to be one of them.

Cheers,

Aljoscha

On Thu, 22 Dec 2016 at 17:54 Greg Hogan <[hidden email]> wrote:

Aljoscha,

For the second, possible solution is there also a requirement that the data sinks handle out-of-order writes? If the new job outpaces the old job which is then terminated, the final write from the old job could have overwritten "newer" writes from the new job.

Greg

On Tue, Dec 20, 2016 at 12:27 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
zero-downtime updates are currently not supported. What is supported in Flink right now is a savepoint-shutdown-restore cycle. With this, you first draw a savepoint (which is essentially a checkpoint with some meta data), then you cancel your job, then you do whatever you need to do (update machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint, then start a second Flink job from that savepoint, then shutdown the first job. With this, your data sinks would need to be able to handle being written to by 2 jobs at the same time, i.e. writes should probably be idempotent.

This is the link to the savepoint doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]> wrote:
Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Gyula Fóra

Re: Flink rolling upgrade support

Hi!

I think in many cases it is more convenient to have a savepoint-and-stop operation to use for upgrading the cluster/job but it should not be required. If the output of your job needs to be exactly once and you don't have an external deduplication mechanism than even the current fault-tolerance mechanism is not good enough to serve you under normal operations.

Cheers,

Gyula

Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2016. dec. 23., P, 19:54):

Hi Greg,
yes certainly, there are more requirements to this than the quick sketch I gave above and that seems to be one of them.

Cheers,
Aljoscha

On Thu, 22 Dec 2016 at 17:54 Greg Hogan <[hidden email]> wrote:
Aljoscha,

For the second, possible solution is there also a requirement that the data sinks handle out-of-order writes? If the new job outpaces the old job which is then terminated, the final write from the old job could have overwritten "newer" writes from the new job.

Greg

On Tue, Dec 20, 2016 at 12:27 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
zero-downtime updates are currently not supported. What is supported in Flink right now is a savepoint-shutdown-restore cycle. With this, you first draw a savepoint (which is essentially a checkpoint with some meta data), then you cancel your job, then you do whatever you need to do (update machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint, then start a second Flink job from that savepoint, then shutdown the first job. With this, your data sinks would need to be able to handle being written to by 2 jobs at the same time, i.e. writes should probably be idempotent.

This is the link to the savepoint doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]> wrote:
Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

Moiz Jinia

Re: Flink rolling upgrade support

In reply to this post by Aljoscha Krettek

Aljoscha Krettek wrote

Hi,
zero-downtime updates are currently not supported. What is supported in
Flink right now is a savepoint-shutdown-restore cycle. With this, you first
draw a savepoint (which is essentially a checkpoint with some meta data),
then you cancel your job, then you do whatever you need to do (update
machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint,
then start a second Flink job from that savepoint, then shutdown the first
job. With this, your data sinks would need to be able to handle being
written to by 2 jobs at the same time, i.e. writes should probably be
idempotent.

This is the link to the savepoint doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <[hidden email]>
wrote:

> Hi. Does Apache Flink currently have support for zero down time or the =
> ability to do rolling upgrades?
>
> If so, what are concerns to watch for and what best practices might =
> exist? Are there version management and data inconsistency issues to =
> watch for?=
>

When a second job instance is started in parallel from a savepoint, my incoming kafka messages would get sharded between the 2 running instances of the job (since they both would belong to the same consumer group). So when I stop the older version of the job, i stand to lose data (inspite of the fact that my downstream consumer is idempotent)

If I used a different consumer group for the new job version (and start it from a savepoint), will the savepoint ensure that the second job instance starts from the correct offset? Do I need to do anything extra to make this work? (example set the uid on the source of the job).

Thanks!
Moiz

Aljoscha Krettek

Re: Flink rolling upgrade support

This was now answered in your other Thread, right?

Best,

Aljoscha

On 18. Jul 2017, at 11:37, Moiz Jinia <[hidden email]> wrote:

Aljoscha Krettek wrote
Hi,
zero-downtime updates are currently not supported. What is supported in
Flink right now is a savepoint-shutdown-restore cycle. With this, you
first
draw a savepoint (which is essentially a checkpoint with some meta data),
then you cancel your job, then you do whatever you need to do (update
machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint,
then start a second Flink job from that savepoint, then shutdown the first
job. With this, your data sinks would need to be able to handle being
written to by 2 jobs at the same time, i.e. writes should probably be
idempotent.

This is the link to the savepoint doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <

ahoblitzell@

>
wrote:

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

When a second job instance is started in parallel from a savepoint, my
incoming kafka messages would get sharded between the 2 running instances of
the job (since they both would belong to the same consumer group). So when I
stop the older version of the job, i stand to lose data (inspite of the fact
that my downstream consumer is idempotent)

If I used a different consumer group for the new job version (and start it
from a savepoint), will the savepoint ensure that the second job instance
starts from the correct offset? Do I need to do anything extra to make this
work? (example set the uid on the source of the job).

Thanks!
Moiz

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14313.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Moiz Jinia

Re: Flink rolling upgrade support

Yup! Thanks.

Moiz

—

sent from phone

On 19-Jul-2017, at 9:21 PM, Aljoscha Krettek [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:

This was now answered in your other Thread, right?

Best,

Aljoscha

On 18. Jul 2017, at 11:37, Moiz Jinia <[hidden email]> wrote:

Aljoscha Krettek wrote
Hi,
zero-downtime updates are currently not supported. What is supported in
Flink right now is a savepoint-shutdown-restore cycle. With this, you
first
draw a savepoint (which is essentially a checkpoint with some meta data),
then you cancel your job, then you do whatever you need to do (update
machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint,
then start a second Flink job from that savepoint, then shutdown the first
job. With this, your data sinks would need to be able to handle being
written to by 2 jobs at the same time, i.e. writes should probably be
idempotent.

This is the link to the savepoint doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <

ahoblitzell@

>
wrote:

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

When a second job instance is started in parallel from a savepoint, my
incoming kafka messages would get sharded between the 2 running instances of
the job (since they both would belong to the same consumer group). So when I
stop the older version of the job, i stand to lose data (inspite of the fact
that my downstream consumer is idempotent)

If I used a different consumer group for the new job version (and start it
from a savepoint), will the savepoint ensure that the second job instance
starts from the correct offset? Do I need to do anything extra to make this
work? (example set the uid on the source of the job).

Thanks!
Moiz

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14313.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the discussion below:

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14337.html

To unsubscribe from Flink rolling upgrade support, click here.
NAML

Ted Yu

Re: Flink rolling upgrade support

This was the other thread, right ?

http://search-hadoop.com/m/Flink/VkLeQ0dXIf1SkHpY?subj=Re+Does+job+restart+resume+from+last+known+internal+checkpoint+

On Wed, Jul 19, 2017 at 9:02 AM, Moiz Jinia <[hidden email]> wrote:

Yup! Thanks.

Moiz

—
sent from phone

On 19-Jul-2017, at 9:21 PM, Aljoscha Krettek [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:

This was now answered in your other Thread, right?

Best,
Aljoscha

On 18. Jul 2017, at 11:37, Moiz Jinia <[hidden email]> wrote:

Aljoscha Krettek wrote
Hi,
zero-downtime updates are currently not supported. What is supported in
Flink right now is a savepoint-shutdown-restore cycle. With this, you
first
draw a savepoint (which is essentially a checkpoint with some meta data),
then you cancel your job, then you do whatever you need to do (update
machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint,
then start a second Flink job from that savepoint, then shutdown the first
job. With this, your data sinks would need to be able to handle being
written to by 2 jobs at the same time, i.e. writes should probably be
idempotent.

This is the link to the savepoint doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <

ahoblitzell@

>
wrote:

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

When a second job instance is started in parallel from a savepoint, my
incoming kafka messages would get sharded between the 2 running instances of
the job (since they both would belong to the same consumer group). So when I
stop the older version of the job, i stand to lose data (inspite of the fact
that my downstream consumer is idempotent)

If I used a different consumer group for the new job version (and start it
from a savepoint), will the savepoint ensure that the second job instance
starts from the correct offset? Do I need to do anything extra to make this
work? (example set the uid on the source of the job).

Thanks!
Moiz

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14313.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14337.html

To unsubscribe from Flink rolling upgrade support, click here.
NAML

View this message in context: Re: Flink rolling upgrade support

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Moiz Jinia

Re: Flink rolling upgrade support

No. This is the thread that answers my question -

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Does-FlinkKafkaConsumer010-care-about-consumer-group-td14323.html

Moiz

—

sent from phone

On 19-Jul-2017, at 10:04 PM, Ted Yu <[hidden email]> wrote:

This was the other thread, right ?

http://search-hadoop.com/m/Flink/VkLeQ0dXIf1SkHpY?subj=Re+Does+job+restart+resume+from+last+known+internal+checkpoint+

On Wed, Jul 19, 2017 at 9:02 AM, Moiz Jinia <[hidden email]> wrote:

Yup! Thanks.

Moiz

—
sent from phone

On 19-Jul-2017, at 9:21 PM, Aljoscha Krettek [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:

This was now answered in your other Thread, right?

Best,
Aljoscha

On 18. Jul 2017, at 11:37, Moiz Jinia <[hidden email]> wrote:

Aljoscha Krettek wrote
Hi,
zero-downtime updates are currently not supported. What is supported in
Flink right now is a savepoint-shutdown-restore cycle. With this, you
first
draw a savepoint (which is essentially a checkpoint with some meta data),
then you cancel your job, then you do whatever you need to do (update
machines, update Flink, update Job) and restore from the savepoint.

A possible solution for zero-downtime update would be to do a savepoint,
then start a second Flink job from that savepoint, then shutdown the first
job. With this, your data sinks would need to be able to handle being
written to by 2 jobs at the same time, i.e. writes should probably be
idempotent.

This is the link to the savepoint doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/savepoints.html

Does that help?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 18:16 Andrew Hoblitzell <

ahoblitzell@

>
wrote:

Hi. Does Apache Flink currently have support for zero down time or the =
ability to do rolling upgrades?

If so, what are concerns to watch for and what best practices might =
exist? Are there version management and data inconsistency issues to =
watch for?=

When a second job instance is started in parallel from a savepoint, my
incoming kafka messages would get sharded between the 2 running instances of
the job (since they both would belong to the same consumer group). So when I
stop the older version of the job, i stand to lose data (inspite of the fact
that my downstream consumer is idempotent)

If I used a different consumer group for the new job version (and start it
from a savepoint), will the savepoint ensure that the second job instance
starts from the correct offset? Do I need to do anything extra to make this
work? (example set the uid on the source of the job).

Thanks!
Moiz

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14313.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-rolling-upgrade-support-tp10674p14337.html

To unsubscribe from Flink rolling upgrade support, click here.
NAML

View this message in context: Re: Flink rolling upgrade support

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.