Automatically resuming failed jobs in K8s
Posted by
Averell on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Automatically-resuming-failed-jobs-in-K8s-tp35846.html
Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job
Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config
Thanks a lot.
Regards,
Averell
--
Sent from:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/