Flink 1.11 job stop with save point timeout error

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.11 job stop with save point timeout error

Ivan Yang
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before



flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 


Thanks in advance,
Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 job stop with save point timeout error

rmetzger0
Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before



flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 


Thanks in advance,
Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 job stop with save point timeout error

Ivan Yang
Hi Robert,
Below is the job manager log after issuing the “flink stop” command

====================
2020-07-24 19:24:12,388 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:24:13,914 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).
2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:27:59,655 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:28:00,962 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).
======================

It looks normal to me. 

In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml

# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager

Could this reporter interval interfere the job manager? I test the same job in a standalone 
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”.

Thanks 
Ivan


On Jul 24, 2020, at 5:15 AM, Robert Metzger <[hidden email]> wrote:

Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before



flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 


Thanks in advance,
Ivan

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 job stop with save point timeout error

Congxian Qiu
Hi Ivan
   From the JM log, the savepoint complete with 1 second, and the timeout exception said that the stop-with-savepoint can not be completed in 60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s -- RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure what the root cause is currently, could you please share the complete job JM log. thanks.


Ivan Yang <[hidden email]> 于2020年7月25日周六 上午3:48写道:
Hi Robert,
Below is the job manager log after issuing the “flink stop” command

====================
2020-07-24 19:24:12,388 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:24:13,914 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).
2020-07-24 19:27:59,299 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:27:59,655 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e.
2020-07-24 19:28:00,962 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).
======================

It looks normal to me. 

In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml

# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager

Could this reporter interval interfere the job manager? I test the same job in a standalone 
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”.

Thanks 
Ivan


On Jul 24, 2020, at 5:15 AM, Robert Metzger <[hidden email]> wrote:

Hi Ivan,
thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout.

On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
Hello everyone,

We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before



flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f
Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
... 9 more
flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ 


Thanks in advance,
Ivan