(DEPRECATED) Apache Flink User Mailing List archive.

Apache Flink - Question about application restart

Classic

List

Threaded

11 messages Options

M Singh

Apache Flink - Question about application restart

Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?

2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:

1. Does the job get a new id ? I believe it does, but just want to confirm.

2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Zhu Zhu

Re: Apache Flink - Question about application restart

Hi M,

Regarding your questions:

1. yes. The id is fixed once the job graph is generated.

2. yes

Regarding yarn mode:

1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse

2. yes if high availability is enabled

Thanks,

Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：

Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Yang Wang

Re: Apache Flink - Question about application restart

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then

the whole application will failed. If you start another instance(Yarn application),

even the high availability is configured, we could not recover from the latest

checkpoint because the clusterId(i.e. applicationId) has changed.

Best,

Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：

Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

M Singh

Re: Apache Flink - Question about application restart

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then

the whole application will failed. If you start another instance(Yarn application),

even the high availability is configured, we could not recover from the latest

checkpoint because the clusterId(i.e. applicationId) has changed.

Best,

Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：

Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

M Singh

Re: Apache Flink - Question about application restart

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then

the whole application will failed. If you start another instance(Yarn application),

even the high availability is configured, we could not recover from the latest

checkpoint because the clusterId(i.e. applicationId) has changed.

Best,

Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：

Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Till Rohrmann

Re: Apache Flink - Question about application restart

Hi,

if you submit the same job multiple times, then it will get every time a different JobID assigned. For Flink, different job submissions are considered to be different jobs. Once a job has been submitted, it will keep the same JobID which is important in order to retrieve the checkpoints associated with this job.

Cheers,

Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Zhu Zhu

Re: Apache Flink - Question about application restart

Hi M,

Sorry I missed your message.

JobID will not change for a generated JobGraph. However, a new JobGraph will be generated each time a job is submitted.

So that multiple submissions will have multiple JobGraphs. This is because different submissions are considered as different jobs, as Till mentioned.

One example is that you can submit an application to a cluster multiple times at the same time, different JobIDs are needed to differentiate them.

Thanks,

Zhu Zhu

Till Rohrmann <[hidden email]> 于2020年5月27日周三下午10:05写道：

Hi,

if you submit the same job multiple times, then it will get every time a different JobID assigned. For Flink, different job submissions are considered to be different jobs. Once a job has been submitted, it will keep the same JobID which is important in order to retrieve the checkpoints associated with this job.

Cheers,
Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

M Singh

Re: Apache Flink - Question about application restart

In reply to this post by Till Rohrmann

Hi Till/Zhu/Yang: Thanks for your replies.

So just to clarify - the job id remains same if the job restarts have not been exhausted. Does Yarn also resubmit the job in case of failures and if so, then is the job id different.

Thanks

On Wednesday, May 27, 2020, 10:05:40 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

Cheers,

Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Till Rohrmann

Re: Apache Flink - Question about application restart

Hi,

Yarn won't resubmit the job. In case of a process failure where Yarn restarts the Flink Master, the Master will recover the submitted jobs from a persistent storage system.

Cheers,

Till

On Thu, May 28, 2020 at 4:05 PM M Singh <[hidden email]> wrote:

Hi Till/Zhu/Yang: Thanks for your replies.

So just to clarify - the job id remains same if the job restarts have not been exhausted. Does Yarn also resubmit the job in case of failures and if so, then is the job id different.

Thanks

On Wednesday, May 27, 2020, 10:05:40 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

if you submit the same job multiple times, then it will get every time a different JobID assigned. For Flink, different job submissions are considered to be different jobs. Once a job has been submitted, it will keep the same JobID which is important in order to retrieve the checkpoints associated with this job.

Cheers,
Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

M Singh

Re: Apache Flink - Question about application restart

Thanks Till - in the case of restart of flink master - I believe the jobid will be different. Thanks

On Thursday, May 28, 2020, 11:33:38 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

Yarn won't resubmit the job. In case of a process failure where Yarn restarts the Flink Master, the Master will recover the submitted jobs from a persistent storage system.

Cheers,

Till

On Thu, May 28, 2020 at 4:05 PM M Singh <[hidden email]> wrote:

Hi Till/Zhu/Yang: Thanks for your replies.

So just to clarify - the job id remains same if the job restarts have not been exhausted. Does Yarn also resubmit the job in case of failures and if so, then is the job id different.

Thanks

On Wednesday, May 27, 2020, 10:05:40 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

if you submit the same job multiple times, then it will get every time a different JobID assigned. For Flink, different job submissions are considered to be different jobs. Once a job has been submitted, it will keep the same JobID which is important in order to retrieve the checkpoints associated with this job.

Cheers,
Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks

Zhu Zhu

Re: Apache Flink - Question about application restart

Restarting of flink master does not change the jobId if one yarn application.

To be simple, in a yarn application that runs a flink cluster, the job id of a job does not change once the job is submitted.

You can even submit a flink application multiples times to that cluster (if it is session mode) but each submission will be treated as a different job and will have a different job id.

Thanks,

Zhu Zhu

M Singh <[hidden email]> 于2020年5月29日周五上午4:59写道：

Thanks Till - in the case of restart of flink master - I believe the jobid will be different. Thanks

On Thursday, May 28, 2020, 11:33:38 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

Yarn won't resubmit the job. In case of a process failure where Yarn restarts the Flink Master, the Master will recover the submitted jobs from a persistent storage system.

Cheers,
Till

On Thu, May 28, 2020 at 4:05 PM M Singh <[hidden email]> wrote:

Hi Till/Zhu/Yang: Thanks for your replies.

So just to clarify - the job id remains same if the job restarts have not been exhausted. Does Yarn also resubmit the job in case of failures and if so, then is the job id different.

Thanks

On Wednesday, May 27, 2020, 10:05:40 AM EDT, Till Rohrmann <[hidden email]> wrote:

Hi,

if you submit the same job multiple times, then it will get every time a different JobID assigned. For Flink, different job submissions are considered to be different jobs. Once a job has been submitted, it will keep the same JobID which is important in order to retrieve the checkpoints associated with this job.

Cheers,
Till

On Tue, May 26, 2020 at 12:42 PM M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

I have another clafication - it looks like if I run the same app multiple times - it's job id changes. So it looks like even though the graph is the same the job id is not dependent on the job graph only since with different runs of the same app it is not the same.

Please let me know if I've missed anything.

Thanks

On Monday, May 25, 2020, 05:32:39 PM EDT, M Singh <[hidden email]> wrote:

Hi Zhu Zhu:

Just to clarify - from what I understand, EMR also has by default restart times (I think it is 3). So if the EMR restarts the job - the job id is the same since the job graph is the same.

Thanks for the clarification.

On Monday, May 25, 2020, 04:01:17 AM EDT, Yang Wang <[hidden email]> wrote:

Just share some additional information.

When deploying Flink application on Yarn and it exhausted restart policy, then
the whole application will failed. If you start another instance(Yarn application),
even the high availability is configured, we could not recover from the latest
checkpoint because the clusterId(i.e. applicationId) has changed.

Best,
Yang

Zhu Zhu <[hidden email]> 于2020年5月25日周一上午11:17写道：
Hi M,

Regarding your questions:
1. yes. The id is fixed once the job graph is generated.
2. yes

Regarding yarn mode:
1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse
2. yes if high availability is enabled

Thanks,
Zhu Zhu

M Singh <[hidden email]> 于2020年5月23日周六上午4:06写道：
Hi Flink Folks:

If I have a Flink Application with 10 restarts, if it fails and restarts, then:

1. Does the job have the same id ?
2. Does the automatically restarting application, pickup from the last checkpoint ? I am assuming it does but just want to confirm.

Also, if it is running on AWS EMR I believe EMR/Yarn is configured to restart the job 3 times (after it has exhausted it's restart policy) . If that is the case:
1. Does the job get a new id ? I believe it does, but just want to confirm.
2. Does the Yarn restart honor the last checkpoint ? I believe, it does not, but is there a way to make it restart from the last checkpoint of the failed job (after it has exhausted its restart policy) ?

Thanks