|
Yes, that would work. But it might be still interesting to understand why you ran into the timeout. Was it just a big state that just took longer than expected? Or some network issue? ...that's just for you to understand the underlying issue in a better way. But I'm glad the savepoint creation was successful in the end.
Best, Matthias Hi Matthias, You are correct. After a few minutes I took another look at my savepoint folder and the data was there. I think increasing the timeout may resolve the problem?
Hi Robert, it would be interesting to see the corresponding taskmanager/jobmanager logs. That would help in finding out why the savepoint creation failed. Just to verify: The savepoint data wasn't written to S3 even after the timeout happened, was it?
Best, Matthias I triggered a savepoint from a currently running job. Although the directory structure gets created in the MINIO S3 store, the command ultimately fails without writing the data.
root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:00,409 INFO org.apache.flink.kubernetes.KubernetesClusterDescriptor [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client:/opt/flink# ./bin/flink savepoint 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:58,776 INFO org.apache.flink.kubernetes.KubernetesClusterDescriptor [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b.
Waiting for response...
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 72f614340dc1a7416d0613362d1ef83b failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
... 7 more
root@flink-client:/opt/flink#
-- Robert Cullen 240-475-4490
--
Robert Cullen 240-475-4490
|