Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone. The JM pod, as it is deployed using a deployment, will be re-created automatically. However, all of my jobs are lost. What I have to do now are: 1. Re-upload the jars 2. Find the path to the last checkpoint of each job 3. Resubmit the job Is there any existing option to automate those steps? E.g. 1. Can I use a jar file stored in the JM's file system or on S3 instead of uploading the jar file via REST interface? 2. When restoring the job, I need to provide the full path of the last checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option to just provide the base_path? 3. Store the info to restore the jobs in the K8s deployment config Thanks a lot. Regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Averell, Thanks for trying the native K8s integration. All your issues are due to high availability not configured. If you start a HA Flink cluster, like following, then when JobManager/TaskManager terminated exceptionally, all the jobs could recover and restore from the latest checkpoint. Even you delete the Flink cluster, when you start a new one with same cluster-id, it could also be recovered. Note that all the jobs did not fail or was canceled. Please remember that you need to put the s3 filesystem jar into the plugin directory in image manually[1]. ./bin/kubernetes-session.sh \ -Dkubernetes.cluster-id=k8s-ha-session-1 \ -Dkubernetes.container.image=<IMAGE> \ -Djobmanager.heap.size=4096m \ -Dtaskmanager.memory.process.size=4096m \ -Dtaskmanager.numberOfTaskSlots=4 \ -Dkubernetes.jobmanager.cpu=1 -Dkubernetes.taskmanager.cpu=2 \-Dhigh-availability=zookeeper \ -Dhigh-availability.zookeeper.quorum=<ZK_QUORUM>:2181 \ -Dhigh-availability.storageDir=s3://your-s3/flink-ha-k8s \ -Drestart-strategy=fixed-delay -Drestart-strategy.fixed-delay.attempts=10 Moreover, we do not have a native K8s HA, so we still need to use the ZK HA. But it is already in plan[2] and i hope it could be done soon. Then enable the HA for K8s native integration will be more convenient. Best, Yang Averell <[hidden email]> 于2020年6月10日周三 下午7:02写道: Hi, |
Thank you very much, Yang.
-- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |