(DEPRECATED) Apache Flink User Mailing List archive.

Flink 1.11 job stop with save point timeout error

Classic

List

Threaded

4 messages Options

Ivan Yang

Flink 1.11 job stop with save point timeout error

Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.

I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before

flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f

Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------

 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".

	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)

	at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)

	at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)

	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)

	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)

	at java.security.AccessController.doPrivileged(Native Method)

	at javax.security.auth.Subject.doAs(Subject.java:422)

	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)

	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)

	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)

Caused by: java.util.concurrent.TimeoutException

	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)

	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)

	at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)

	... 9 more

flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 

Thanks in advance,

Ivan

rmetzger0

Re: Flink 1.11 job stop with save point timeout error

Hi Ivan,

thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:

Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before

flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$

Thanks in advance,
Ivan

Ivan Yang

Re: Flink 1.11 job stop with save point timeout error

Hi Robert,

Below is the job manager log after issuing the “flink stop” command

====================

2020-07-24 19:24:12,388 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e.

2020-07-24 19:24:13,914 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).

2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e.

2020-07-24 19:27:59,655 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e.

2020-07-24 19:28:00,962 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).

======================

It looks normal to me. 

In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml

# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager

Could this reporter interval interfere the job manager? I test the same job in a standalone 
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”.

Thanks 
Ivan

On Jul 24, 2020, at 5:15 AM, Robert Metzger <[hidden email]> wrote:

Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before

flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$

Thanks in advance,
Ivan

Congxian Qiu

Re: Flink 1.11 job stop with save point timeout error

Hi Ivan

From the JM log, the savepoint complete with 1 second, and the timeout exception said that the stop-with-savepoint can not be completed in 60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s -- RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure what the root cause is currently, could you please share the complete job JM log. thanks.

[1] https://github.com/apache/flink/blob/abd58adb7aad54d242b67219498c211e9e18168b/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L382

Best,

Congxian

Ivan Yang <[hidden email]> 于2020年7月25日周六上午3:48写道：

Hi Robert,
Below is the job manager log after issuing the “flink stop” command

====================
2020-07-24 19:24:12,388 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:24:13,914 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).
2020-07-24 19:27:59,299 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:27:59,655 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:28:00,962 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).
======================

It looks normal to me.

In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml

# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager

Could this reporter interval interfere the job manager? I test the same job in a standalone
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”.

Thanks
Ivan

On Jul 24, 2020, at 5:15 AM, Robert Metzger <[hidden email]> wrote:

Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before

flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$

Thanks in advance,
Ivan