Recovery from job manager crash using check points

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Recovery from job manager crash using check points

min.tan

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min

 



E-mails can involve SUBSTANTIAL RISKS, e.g. lack of confidentiality, potential manipulation of contents and/or sender's address, incorrect recipient (misdirection), viruses etc. Based on previous e-mail correspondence with you and/or an agreement reached with you, UBS considers itself authorized to contact you via e-mail. UBS assumes no responsibility for any loss or damage resulting from the use of e-mails.
The recipient is aware of and accepts the inherent risks of using e-mails, in particular the risk that the banking relationship and confidential information relating thereto are disclosed to third parties.
UBS reserves the right to retain and monitor all messages. Messages are protected and accessed only in legally justified cases.
For information on how UBS uses and discloses personal data, how long we retain it, how we keep it secure and your data protection rights, please see our Privacy Notice http://www.ubs.com/privacy-statement
Reply | Threaded
Open this post in threaded view
|

Re: Recovery from job manager crash using check points

miki haiat
Wich kind of deployment system are you using,
Standalone ,yarn ... Other ?

On Mon, Aug 19, 2019, 18:28 <[hidden email]> wrote:

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min

 

Reply | Threaded
Open this post in threaded view
|

Re: Recovery from job manager crash using check points

Biao Liu
In reply to this post by min.tan
Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.


On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min

 

Reply | Threaded
Open this post in threaded view
|

Re: Recovery from job manager crash using check points

tison
Hi Min,

I guess you use standalone high-availability and when TM fails,
JM can recovered the job from an in-memory checkpoint store.

However, when JM fails, since you don't persist state on ha backend
such as ZooKeeper, even JM relaunched by YARN RM superseded by a
stand by, the new one knows nothing about the previous jobs.

In short, you need to set up ZooKeepers as you yourself mentioned.

Best,
tison.


Biao Liu <[hidden email]> 于2019年8月19日周一 下午11:49写道:
Hi Min,

> Do I need to set up zookeepers to keep the states when a job manager crashes?

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.


On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min

 

Reply | Threaded
Open this post in threaded view
|

RE: Recovery from job manager crash using check points

min.tan

Thanks for the helpful reply.

 

One more question, does this zookeeper or HA requirement apply for a savepoint?

 

Can I bounce a single jobmanager cluster and rerun my flink job from its previous states with a save point directory? e.g.

./bin/flink run myJob.jar -s savepointDirectory

 

Regards,

 

Min

 

From: Zili Chen [mailto:[hidden email]]
Sent: Dienstag, 20. August 2019 04:16
To: Biao Liu
Cc: Tan, Min; user
Subject: [External] Re: Recovery from job manager crash using check points

 

Hi Min,

 

I guess you use standalone high-availability and when TM fails,

JM can recovered the job from an in-memory checkpoint store.

 

However, when JM fails, since you don't persist state on ha backend

such as ZooKeeper, even JM relaunched by YARN RM superseded by a

stand by, the new one knows nothing about the previous jobs.

 

In short, you need to set up ZooKeepers as you yourself mentioned.

 

Best,

tison.

 

 

Biao Liu <[hidden email]> 2019819日周一 下午11:49写道:

Hi Min,

 

> Do I need to set up zookeepers to keep the states when a job manager crashes?

 

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

 

 

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min

 



E-mails can involve SUBSTANTIAL RISKS, e.g. lack of confidentiality, potential manipulation of contents and/or sender's address, incorrect recipient (misdirection), viruses etc. Based on previous e-mail correspondence with you and/or an agreement reached with you, UBS considers itself authorized to contact you via e-mail. UBS assumes no responsibility for any loss or damage resulting from the use of e-mails.
The recipient is aware of and accepts the inherent risks of using e-mails, in particular the risk that the banking relationship and confidential information relating thereto are disclosed to third parties.
UBS reserves the right to retain and monitor all messages. Messages are protected and accessed only in legally justified cases.
For information on how UBS uses and discloses personal data, how long we retain it, how we keep it secure and your data protection rights, please see our Privacy Notice http://www.ubs.com/privacy-statement
Reply | Threaded
Open this post in threaded view
|

Re: Recovery from job manager crash using check points

tison
Hi Min,

For your question, the answer is no.

In standalone case Flink uses an in memory checkpoint store which
is able to restore your savepoint configured in command-line and
recover states from it.

Besides, stop with savepoint and resume the job from savepoint
is the standard path to migrate jobs.

Best,
tison.


<[hidden email]> 于2019年8月21日周三 下午9:46写道:

Thanks for the helpful reply.

 

One more question, does this zookeeper or HA requirement apply for a savepoint?

 

Can I bounce a single jobmanager cluster and rerun my flink job from its previous states with a save point directory? e.g.

./bin/flink run myJob.jar -s savepointDirectory

 

Regards,

 

Min

 

From: Zili Chen [mailto:[hidden email]]
Sent: Dienstag, 20. August 2019 04:16
To: Biao Liu
Cc: Tan, Min; user
Subject: [External] Re: Recovery from job manager crash using check points

 

Hi Min,

 

I guess you use standalone high-availability and when TM fails,

JM can recovered the job from an in-memory checkpoint store.

 

However, when JM fails, since you don't persist state on ha backend

such as ZooKeeper, even JM relaunched by YARN RM superseded by a

stand by, the new one knows nothing about the previous jobs.

 

In short, you need to set up ZooKeepers as you yourself mentioned.

 

Best,

tison.

 

 

Biao Liu <[hidden email]> 2019819日周一 下午11:49写道:

Hi Min,

 

> Do I need to set up zookeepers to keep the states when a job manager crashes?

 

I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend.

 

 

On Mon, 19 Aug 2019 at 23:28, <[hidden email]> wrote:

Hi,

 

I can use check points to recover Flink states when a task manger crashes.

 

I can not use check points to recover Flink states when a job manger crashes.

 

Do I need to set up zookeepers to keep the states when a job manager crashes?

 

Regards

 

Min