(DEPRECATED) Apache Flink User Mailing List archive.

Recovery from job manager crash using check points

Classic

List

Threaded

6 messages Options

min.tan

Recovery from job manager crash using check points

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min

E-mails can involve SUBSTANTIAL RISKS, e.g. lack of confidentiality, potential manipulation of contents and/or sender's address, incorrect recipient (misdirection), viruses etc. Based on previous e-mail correspondence with you and/or an agreement reached with you, UBS considers itself authorized to contact you via e-mail. UBS assumes no responsibility for any loss or damage resulting from the use of e-mails.
The recipient is aware of and accepts the inherent risks of using e-mails, in particular the risk that the banking relationship and confidential information relating thereto are disclosed to third parties.
UBS reserves the right to retain and monitor all messages. Messages are protected and accessed only in legally justified cases.
For information on how UBS uses and discloses personal data, how long we retain it, how we keep it secure and your data protection rights, please see our Privacy Notice http://www.ubs.com/privacy-statement

miki haiat

Re: Recovery from job manager crash using check points

Wich kind of deployment system are you using,

Standalone ,yarn ... Other ?

On Mon, Aug 19, 2019, 18:28 <[hidden email]> wrote:

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min

Biao Liu

Re: Recovery from job manager crash using check points

In reply to this post by min.tan

Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

1. https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availability.html

2. https://ci.apache.org/projects/flink/flink-docs-master/ops/state/state_backends.html

Thanks,

Biao /'bɪ.aʊ/

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min

tison

Re: Recovery from job manager crash using check points

Hi Min,

I guess you use standalone high-availability and when TM fails,

JM can recovered the job from an in-memory checkpoint store.

However, when JM fails, since you don't persist state on ha backend

such as ZooKeeper, even JM relaunched by YARN RM superseded by a

stand by, the new one knows nothing about the previous jobs.

In short, you need to set up ZooKeepers as you yourself mentioned.

Best,

tison.

Biao Liu <[hidden email]> 于2019年8月19日周一下午11:49写道：

Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

1. https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availability.html
2. https://ci.apache.org/projects/flink/flink-docs-master/ops/state/state_backends.html

Thanks,
Biao /'bɪ.aʊ/

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min

min.tan

RE: Recovery from job manager crash using check points

Thanks for the helpful reply.

One more question, does this zookeeper or HA requirement apply for a savepoint?

Can I bounce a single jobmanager cluster and rerun my flink job from its previous states with a save point directory? e.g.

./bin/flink run myJob.jar -s savepointDirectory

Regards,

Min

From: Zili Chen [mailto:[hidden email]]
Sent: Dienstag, 20. August 2019 04:16
To: Biao Liu
Cc: Tan, Min; user
Subject: [External] Re: Recovery from job manager crash using check points

Hi Min,

I guess you use standalone high-availability and when TM fails,

JM can recovered the job from an in-memory checkpoint store.

However, when JM fails, since you don't persist state on ha backend

such as ZooKeeper, even JM relaunched by YARN RM superseded by a

stand by, the new one knows nothing about the previous jobs.

In short, you need to set up ZooKeepers as you yourself mentioned.

Best,

tison.

Biao Liu <[hidden email]> 于2019年8月19日周一下午11:49写道：

Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

1. https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availability.html

2. https://ci.apache.org/projects/flink/flink-docs-master/ops/state/state_backends.html

Thanks,

Biao /'bɪ.aʊ/

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min

tison

Re: Recovery from job manager crash using check points

Hi Min,

For your question, the answer is no.

In standalone case Flink uses an in memory checkpoint store which

is able to restore your savepoint configured in command-line and

recover states from it.

Besides, stop with savepoint and resume the job from savepoint

is the standard path to migrate jobs.

Best,

tison.

<[hidden email]> 于2019年8月21日周三下午9:46写道：

Thanks for the helpful reply.

One more question, does this zookeeper or HA requirement apply for a savepoint?

Can I bounce a single jobmanager cluster and rerun my flink job from its previous states with a save point directory? e.g.

./bin/flink run myJob.jar -s savepointDirectory

Regards,

Min

From: Zili Chen [mailto:[hidden email]]
Sent: Dienstag, 20. August 2019 04:16
To: Biao Liu
Cc: Tan, Min; user
Subject: [External] Re: Recovery from job manager crash using check points

Hi Min,

I guess you use standalone high-availability and when TM fails,

JM can recovered the job from an in-memory checkpoint store.

However, when JM fails, since you don't persist state on ha backend

such as ZooKeeper, even JM relaunched by YARN RM superseded by a

stand by, the new one knows nothing about the previous jobs.

In short, you need to set up ZooKeepers as you yourself mentioned.

Best,

tison.

Biao Liu <[hidden email]> 于2019年8月19日周一下午11:49写道：

Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

1. https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availability.html

2. https://ci.apache.org/projects/flink/flink-docs-master/ops/state/state_backends.html

Thanks,

Biao /'bɪ.aʊ/

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

I can use check points to recover Flink states when a task manger crashes.

I can not use check points to recover Flink states when a job manger crashes.

Do I need to set up zookeepers to keep the states when a job manager crashes?

Regards

Min