(DEPRECATED) Apache Flink User Mailing List archive.

JobManager restarts all jobs

Classic

List

Threaded

3 messages Options

Dominik Wosiński

JobManager restarts all jobs

Hey,
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception.

The log looks like below :

Fabian Hueske-2

Re: JobManager restarts all jobs

Hi Dominik,

Which version are you running?

Till (in CC) is most familiar with the job recovery and might be able to help.

Best,

Fabian

2018-06-21 11:01 GMT+02:00 Dominik Wosiński <[hidden email]>:

Hey,
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception.

The log looks like below :

Till Rohrmann

Re: JobManager restarts all jobs

Hi Dominik,

could you share with us the cluster entrypoint log files. I guess this problem is related to FLINK-9575 [1]. I think we should not delete the blob files if the we could not delete the JobGraph from the submitted job graph store.

[1] https://issues.apache.org/jira/browse/FLINK-9575.

Cheers,

Till

On Fri, Jun 22, 2018 at 9:57 AM Fabian Hueske <[hidden email]> wrote:

Hi Dominik,

Which version are you running?
Till (in CC) is most familiar with the job recovery and might be able to help.

Best,
Fabian

2018-06-21 11:01 GMT+02:00 Dominik Wosiński <[hidden email]>:
Hey,
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception.

The log looks like below :