JobManager restarts all jobs

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

JobManager restarts all jobs

Dominik Wosiński
Hey, 
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

 But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception. 

The log looks like below : 

Reply | Threaded
Open this post in threaded view
|

Re: JobManager restarts all jobs

Fabian Hueske-2
Hi Dominik,

Which version are you running?
Till (in CC) is most familiar with the job recovery and might be able to help.

Best,
Fabian

2018-06-21 11:01 GMT+02:00 Dominik Wosiński <[hidden email]>:
Hey, 
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

 But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception. 

The log looks like below : 


Reply | Threaded
Open this post in threaded view
|

Re: JobManager restarts all jobs

Till Rohrmann
Hi Dominik,

could you share with us the cluster entrypoint log files. I guess this problem is related to FLINK-9575 [1]. I think we should not delete the blob files if the we could not delete the JobGraph from the submitted job graph store.


Cheers,
Till

On Fri, Jun 22, 2018 at 9:57 AM Fabian Hueske <[hidden email]> wrote:
Hi Dominik,

Which version are you running?
Till (in CC) is most familiar with the job recovery and might be able to help.

Best,
Fabian

2018-06-21 11:01 GMT+02:00 Dominik Wosiński <[hidden email]>:
Hey, 
I have some kind of issue, possibly connected with configuration. When I have started JobManager in HA mode zookeeper. Everything works fine, jobs are executed etc.

 But when I Cancel any job or the job fails, after the reelection of the leader, the new JobManager restarts all the jobs even those that were failed and cancelled.

This causes failure of the JobManager since I am using Blob Server on HDFS and if the job is cancelled or fails its blob data is immediately removed, the JobManager tries to access BlobData that does not exist and throws no such file exception. 

The log looks like below :