(DEPRECATED) Apache Flink User Mailing List archive.

Automatically resuming failed jobs in K8s

Classic

List

Threaded

3 messages Options

Averell

Automatically resuming failed jobs in K8s

Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job

Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config

Thanks a lot.

Regards,
Averell

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Yang Wang

Re: Automatically resuming failed jobs in K8s

Hi Averell,

Thanks for trying the native K8s integration. All your issues are due to high availability

not configured. If you start a HA Flink cluster, like following, then when JobManager/TaskManager

terminated exceptionally, all the jobs could recover and restore from the latest checkpoint.

Even you delete the Flink cluster, when you start a new one with same cluster-id, it could

also be recovered. Note that all the jobs did not fail or was canceled.

Please remember that you need to put the s3 filesystem jar into the plugin directory in image manually[1].

./bin/kubernetes-session.sh \
-Dkubernetes.cluster-id=k8s-ha-session-1 \
-Dkubernetes.container.image=<IMAGE> \  -Djobmanager.heap.size=4096m \
-Dtaskmanager.memory.process.size=4096m \
-Dtaskmanager.numberOfTaskSlots=4 \
-Dkubernetes.jobmanager.cpu=1 -Dkubernetes.taskmanager.cpu=2 \-Dhigh-availability=zookeeper \
-Dhigh-availability.zookeeper.quorum=<ZK_QUORUM>:2181 \
-Dhigh-availability.storageDir=s3://your-s3/flink-ha-k8s \
-Drestart-strategy=fixed-delay -Drestart-strategy.fixed-delay.attempts=10

Moreover, we do not have a native K8s HA, so we still need to use the ZK HA. But it is already in plan[2] and

i hope it could be done soon. Then enable the HA for K8s native integration will be more convenient.

[1]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html#using-plugins

[2]. https://issues.apache.org/jira/browse/FLINK-12884

Best,

Yang

Averell <[hidden email]> 于2020年6月10日周三下午7:02写道：

Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job

Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config

Thanks a lot.

Regards,
Averell

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Averell

Re: Automatically resuming failed jobs in K8s

Thank you very much, Yang.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/