(DEPRECATED) Apache Flink User Mailing List archive.

K8s job cluster and cancel and resume from a save point ?

Classic

List

Threaded

21 messages Options

Vishal Santoshi

Re: K8s job cluster and cancel and resume from a save point ?

I confirm that 1.8.0 fixes all the above issue . The JM process exits with code 0 and exits the pod ( TERMINATED state ) . The above is true for both PATCH cancel and POST save point with cancel as above.

Thank you for fixing this issue.

On Wed, Mar 13, 2019 at 10:17 AM Vishal Santoshi <[hidden email]> wrote:

BTW, does 1.8 also solve the issue where we can cancel with a save point. That too is broken in 1.7.2
curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":true}'    https://*************/jobs/00000000000000000000000000000000/savepoints
On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <[hidden email]> wrote:
Awesome, thanks!
On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <[hidden email]> wrote:
The RC artifacts are only deployed to the Maven Central Repository when the RC
is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
can find the maven artifacts, and the Flink binaries here:

- https://repository.apache.org/content/repositories/orgapacheflink-1210/
- https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/

Alternatively, you can apply the patch yourself, and build Flink 1.7 from
sources [2]. On my machine this takes around 10 minutes if tests are skipped.

Best,
Gary

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <[hidden email]> wrote:
Do you have a mvn repository ( at mvn central ) set up for 1,8 release candidate. We could test it for you.

Without 1.8and this exit code we are essentially held up.
On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <[hidden email]> wrote:
Nobody can tell with 100% certainty. We want to give the RC some exposure
first, and there is also a release process that is prescribed by the ASF [1].
You can look at past releases to get a feeling for how long the release
process lasts [2].

[1] http://www.apache.org/legal/release-policy.html#release-approval
[2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <[hidden email]> wrote:
And when is the 1.8.0 release expected ?
On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <[hidden email]> wrote:
:) That makes so much more sense. Is k8s native flink a part of this release ?
On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <[hidden email]> wrote:
Hi Vishal,

This issue was fixed recently [1], and the patch will be released with 1.8. If
the Flink job gets cancelled, the JVM should exit with code 0. There is a
release candidate [2], which you can test.

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-10743
[2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <[hidden email]> wrote:
Thanks Vijay,

This is the larger issue. The cancellation routine is itself broken.

On cancellation flink does remove the checkpoint counter
2019-03-12 14:12:13,143 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
but exist with a non zero code
2019-03-12 14:12:13,477 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit code 1444.

That I think is an issue. A cancelled job is a complete job and thus the exit code should be 0 for k8s to mark it complete.
On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <[hidden email]> wrote:
Yes Vishal. Thats correct.

Regards
Bhaskar
On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <[hidden email]> wrote:
This really not cool but here you go. This seems to work. Agreed that this cannot be this painful. The cancel does not exit with an exit code pf 0 and thus the job has to manually delete. Vijay does this align with what you have had to do ?
Take a save point . This returns a request id
curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
Make sure the save point succeeded
curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
cancel the job
curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
Delete the job and deployment
kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
Edit the job-cluster-job-deployment.yaml. Add/Edit
args: ["job-cluster",

               "--fromSavepoint",

               "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",

               "--job-classname", .........
Restart
kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
Make sure from the UI, that it restored from the specific save point.
On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <[hidden email]> wrote:
Yes Its supposed to work. But unfortunately it was not working. Flink community needs to respond to this behavior.

Regards
Bhaskar

On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <[hidden email]> wrote:
Aah.
Let me try this out and will get back to you.
Though I would assume that save point with cancel is a single atomic step, rather then a save point followed by a cancellation ( else why would that be an option ).
Thanks again.

On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <[hidden email]> wrote:
Hi Vishal,

yarn-cancel doesn't mean to be for yarn cluster. It works for all clusters. Its recommended command.

Use the following command to issue save point.
curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}' \ https://************.ingress.*******/jobs/00000000000000000000000000000000/savepoints

Then issue yarn-cancel.
After that follow the process to restore save point

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <[hidden email]> wrote:
Hello Vijay,

Thank you for the reply. This though is k8s deployment ( rather then yarn ) but may be they follow the same lifecycle. I issue a save point with cancel as documented here https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, a straight up
curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' \ https://************.ingress.*******/jobs/00000000000000000000000000000000/savepoints

I would assume that after taking the save point, the jvm should exit, after all the k8s deployment is of kind: job and if it is a job cluster then a cancellation should exit the jvm and hence the pod. It does seem to do some things right. It stops a bunch of stuff ( the JobMaster, the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint counter but does not exit the job. And after a little bit the job is restarted which does not make sense and absolutely not the right thing to do ( to me at least ).

Further if I delete the deployment and the job from k8s and restart the job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I have to delete the zk chroot for it to consider the save point.

Thus the process of cancelling and resuming from a SP on a k8s job cluster deployment seems to be
cancel with save point as defined hre https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
delete the job manger job and task manager deployments from k8s almost immediately.
clear the ZK chroot for the 0000000...... job and may be the checkpoints directory.
resumeFromCheckPoint
If some body can say that this indeed is the process ?

Logs are attached.

2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.jobmaster.JobMaster - Savepoint stored in hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now cancelling 00000000000000000000000000000000.

2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job anomaly_echo (00000000000000000000000000000000) switched from state RUNNING to CANCELLING.

2019-03-12 08:10:44,227 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 10 for job 00000000000000000000000000000000 (7238 bytes in 311 ms).

2019-03-12 08:10:44,232 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.

2019-03-12 08:10:44,274 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.

2019-03-12 08:10:44,276 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job anomaly_echo (00000000000000000000000000000000) switched from state CANCELLING to CANCELED.

2019-03-12 08:10:44,276 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 00000000000000000000000000000000.

2019-03-12 08:10:44,277 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Shutting down

2019-03-12 08:10:44,323 INFO org.apache.flink.runtime.checkpoint.CompletedCheckpoint - Checkpoint with ID 8 at 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not discarded.

2019-03-12 08:10:44,437 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Removing /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 from ZooKeeper

2019-03-12 08:10:44,437 INFO org.apache.flink.runtime.checkpoint.CompletedCheckpoint - Checkpoint with ID 10 at 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not discarded.

2019-03-12 08:10:44,447 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.

2019-03-12 08:10:44,447 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper

2019-03-12 08:10:44,463 INFO org.apache.flink.runtime.dispatcher.MiniDispatcher - Job 00000000000000000000000000000000 reached globally terminal state CANCELED.

2019-03-12 08:10:44,467 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job anomaly_echo(00000000000000000000000000000000).

2019-03-12 08:10:44,468 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status CANCELED. Diagnostics null.

2019-03-12 08:10:44,468 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting down rest endpoint.

2019-03-12 08:10:44,473 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.

2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..

2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.

2019-03-12 08:10:44,476 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.

2019-03-12 08:10:44,476 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Disconnect job manager [hidden email]://flink@anomalyecho:6123/user/jobmanager_0 for job 00000000000000000000000000000000 from the resource manager.

2019-03-12 08:10:44,477 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.

After a little bit

Starting the job-cluster

used deprecated key `jobmanager.heap.mb`, please replace with key `jobmanager.heap.size`

Starting standalonejob as a console application on host anomalyecho-mmg6t.
..
..

Regards.

On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <[hidden email]> wrote:
Hi Vishal

Save point with cancellation internally use /cancel REST API. Which is not stable API. It always exits with 404. Best way to issue is:

a) First issue save point REST API
b) Then issue /yarn-cancel rest API( As described in http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@...%3E)
c) Then After resuming your job, provide save point Path as argument for the run jar REST API, which is returned by the (a)
Above is the smoother way

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <[hidden email]> wrote:
There are some issues I see and would want to get some feedback

1. On Cancellation With SavePoint with a Target Directory , the k8s job does not exit ( it is not a deployment ) . I would assume that on cancellation the jvm should exit, after cleanup etc, and thus the pod should too. That does not happen and thus the job pod remains live. Is that expected ?

2. To resume fro a save point it seems that I have to delete the job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to the latest checkpoint no matter what

I am kind of curious as to what in 1.7.2 is the tested process of cancelling with a save point and resuming and what is the cogent story around job id ( defaults to 000000000000.. ). Note that --job-id does not work with 1.7.2 so even though that does not make sense, I still can not provide a new job id.

Regards,

Vishal.