curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' \ https://************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
I would assume that after taking the save point, the jvm should exit, after all the k8s deployment is of kind: job and if it is a job cluster then a cancellation should exit the jvm and hence the pod. It does seem to do some things right. It stops a bunch of stuff ( the JobMaster, the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint counter but does not exit the job. And after a little bit the job is restarted which does not make sense and absolutely not the right thing to do ( to me at least ).
Further if I delete the deployment and the job from k8s and restart the job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I have to delete the zk chroot for it to consider the save point.
Thus the process of cancelling and resuming from a SP on a k8s job cluster deployment seems to be
Logs are attached.
2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.jobmaster.JobMaster - Savepoint stored in hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now cancelling 00000000000000000000000000000000.
2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job anomaly_echo (00000000000000000000000000000000) switched from state RUNNING to CANCELLING.
2019-03-12 08:10:44,227 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 10 for job 00000000000000000000000000000000 (7238 bytes in 311 ms).
2019-03-12 08:10:44,232 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
2019-03-12 08:10:44,274 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
2019-03-12 08:10:44,276 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job anomaly_echo (00000000000000000000000000000000) switched from state CANCELLING to CANCELED.
2019-03-12 08:10:44,276 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 00000000000000000000000000000000.
2019-03-12 08:10:44,277 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Shutting down
2019-03-12 08:10:44,323 INFO org.apache.flink.runtime.checkpoint.CompletedCheckpoint - Checkpoint with ID 8 at 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not discarded.
2019-03-12 08:10:44,437 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Removing /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 from ZooKeeper
2019-03-12 08:10:44,437 INFO org.apache.flink.runtime.checkpoint.CompletedCheckpoint - Checkpoint with ID 10 at 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not discarded.
2019-03-12 08:10:44,447 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.
2019-03-12 08:10:44,447 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
2019-03-12 08:10:44,463 INFO org.apache.flink.runtime.dispatcher.MiniDispatcher - Job 00000000000000000000000000000000 reached globally terminal state CANCELED.
2019-03-12 08:10:44,467 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job anomaly_echo(00000000000000000000000000000000).
2019-03-12 08:10:44,468 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status CANCELED. Diagnostics null.
2019-03-12 08:10:44,468 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting down rest endpoint.
2019-03-12 08:10:44,473 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-03-12 08:10:44,476 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-03-12 08:10:44,476 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Disconnect job manager [hidden email]://flink@anomalyecho:6123/user/jobmanager_0 for job 00000000000000000000000000000000 from the resource manager.
2019-03-12 08:10:44,477 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
Regards.