Triggering Savepoint fails to write data to S3 store

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Triggering Savepoint fails to write data to S3 store

Robert Cullen

I triggered a savepoint from a currently running job. Although the directory structure gets created in the MINIO S3 store, the command ultimately fails without writing the data.

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:00,409 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client:/opt/flink# ./bin/flink savepoint 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:58,776 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b.
Waiting for response...

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 72f614340dc1a7416d0613362d1ef83b failed.
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
        at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
        at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
        ... 7 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Triggering Savepoint fails to write data to S3 store

Matthias
Hi Robert,
it would be interesting to see the corresponding taskmanager/jobmanager logs. That would help in finding out why the savepoint creation failed. Just to verify: The savepoint data wasn't written to S3 even after the timeout happened, was it?

Best,
Matthias

On Thu, May 27, 2021 at 7:50 PM Robert Cullen <[hidden email]> wrote:

I triggered a savepoint from a currently running job. Although the directory structure gets created in the MINIO S3 store, the command ultimately fails without writing the data.

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:00,409 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client:/opt/flink# ./bin/flink savepoint 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:58,776 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b.
Waiting for response...

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 72f614340dc1a7416d0613362d1ef83b failed.
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
        at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
        at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
        ... 7 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Triggering Savepoint fails to write data to S3 store

Robert Cullen
Hi Matthias,  You are correct.  After a few minutes I took another look at my savepoint folder and the data was there.  I think increasing the timeout may resolve the problem?

On Fri, May 28, 2021 at 8:21 AM Matthias Pohl <[hidden email]> wrote:
Hi Robert,
it would be interesting to see the corresponding taskmanager/jobmanager logs. That would help in finding out why the savepoint creation failed. Just to verify: The savepoint data wasn't written to S3 even after the timeout happened, was it?

Best,
Matthias

On Thu, May 27, 2021 at 7:50 PM Robert Cullen <[hidden email]> wrote:

I triggered a savepoint from a currently running job. Although the directory structure gets created in the MINIO S3 store, the command ultimately fails without writing the data.

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:00,409 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client:/opt/flink# ./bin/flink savepoint 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:58,776 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b.
Waiting for response...

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 72f614340dc1a7416d0613362d1ef83b failed.
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
        at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
        at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
        ... 7 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Triggering Savepoint fails to write data to S3 store

Matthias
Yes, that would work. But it might be still interesting to understand why you ran into the timeout. Was it just a big state that just took longer than expected? Or some network issue? ...that's just for you to understand the underlying issue in a better way. But I'm glad the savepoint creation was successful in the end.

Best,
Matthias

On Fri, May 28, 2021 at 2:35 PM Robert Cullen <[hidden email]> wrote:
Hi Matthias,  You are correct.  After a few minutes I took another look at my savepoint folder and the data was there.  I think increasing the timeout may resolve the problem?

On Fri, May 28, 2021 at 8:21 AM Matthias Pohl <[hidden email]> wrote:
Hi Robert,
it would be interesting to see the corresponding taskmanager/jobmanager logs. That would help in finding out why the savepoint creation failed. Just to verify: The savepoint data wasn't written to S3 even after the timeout happened, was it?

Best,
Matthias

On Thu, May 27, 2021 at 7:50 PM Robert Cullen <[hidden email]> wrote:

I triggered a savepoint from a currently running job. Although the directory structure gets created in the MINIO S3 store, the command ultimately fails without writing the data.

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:00,409 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client:/opt/flink# ./bin/flink savepoint 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa
2021-05-27 17:37:58,776 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://flink-jobmanager-rest.cmdaa:8081
Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b.
Waiting for response...

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 72f614340dc1a7416d0613362d1ef83b failed.
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
        at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
        at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
        ... 7 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490