Automatically resuming failed jobs in K8s

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Automatically resuming failed jobs in K8s

Averell
Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job

Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config

Thanks a lot.

Regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Automatically resuming failed jobs in K8s

Yang Wang
Hi Averell,

Thanks for trying the native K8s integration. All your issues are due to high availability
not configured. If you start a HA Flink cluster, like following, then when JobManager/TaskManager
terminated exceptionally, all the jobs could recover and restore from the latest checkpoint.
Even you delete the Flink cluster, when you start a new one with same cluster-id, it could
also be recovered. Note that all the jobs did not fail or was canceled.

Please remember that you need to put the s3 filesystem jar into the plugin directory in image manually[1].

./bin/kubernetes-session.sh \
-Dkubernetes.cluster-id=k8s-ha-session-1 \
-Dkubernetes.container.image=<IMAGE> \  -Djobmanager.heap.size=4096m \
-Dtaskmanager.memory.process.size=4096m \
-Dtaskmanager.numberOfTaskSlots=4 \
-Dkubernetes.jobmanager.cpu=1 -Dkubernetes.taskmanager.cpu=2 \-Dhigh-availability=zookeeper \
-Dhigh-availability.zookeeper.quorum=<ZK_QUORUM>:2181 \
-Dhigh-availability.storageDir=s3://your-s3/flink-ha-k8s \
-Drestart-strategy=fixed-delay -Drestart-strategy.fixed-delay.attempts=10


Moreover, we do not have a native K8s HA, so we still need to use the ZK HA. But it is already in plan[2] and
i hope it could be done soon. Then enable the HA for K8s native integration will be more convenient.


Best,
Yang



Averell <[hidden email]> 于2020年6月10日周三 下午7:02写道:
Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job

Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config

Thanks a lot.

Regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Automatically resuming failed jobs in K8s

Averell