Hello everyone,
We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error. I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is expected. We use kubernetes deployment. I clocked it, it’s about 60 seconds when it returns this error. So afterwards, the job is hanging (it still says running, but actually not doing anything). I need run another command to cancel it. Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for us before flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop 88d9b46f59d131428e2a18c9c7b3aa3f Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint. ------------------------------------------------------------ The program finished with the following exception: org.apache.flink.util.FlinkException: Could not stop with a savepoint job "88d9b46f59d131428e2a18c9c7b3aa3f". at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) ... 9 more flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ Thanks in advance, Ivan
|
Hi Ivan, thanks a lot for your message. Can you post the JobManager log here as well? It might contain additional information on the reason for the timeout. On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[hidden email]> wrote:
|
Hi Robert,
Below is the job manager log after issuing the “flink stop” command ==================== 2020-07-24 19:24:12,388 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job 853c59916ac33dfbf46503b33289929e. 2020-07-24 19:24:13,914 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms). 2020-07-24 19:27:59,299 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering stop-with-savepoint for job 853c59916ac33dfbf46503b33289929e. 2020-07-24 19:27:59,655 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job 853c59916ac33dfbf46503b33289929e. 2020-07-24 19:28:00,962 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms). ====================== It looks normal to me. In the kubernetes deployment cluster, we set up a metric reporter, it has these keys in the flink-config.yaml # Metrics Reporter Configuration metrics.reporters: wavefront metrics.reporter.wavefront.interval: 60 SECONDS metrics.reporter.wavefront.env: prod metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter metrics.reporter.wavefront.host: xxxxxx metrics.reporter.wavefront.token: xxxxxxxxxx metrics.scope.tm: flink.taskmanager Could this reporter interval interfere the job manager? I test the same job in a standalone Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor timeout. Also the same reporter is used in 1.9.1 version where we didn’t have issue on “flink stop”. Thanks Ivan
|
Hi Ivan From the JM log, the savepoint complete with 1 second, and the timeout exception said that the stop-with-savepoint can not be completed in 60s(this was calculated by 20 -- RestOptions#RETRAY_MAX_ATTEMPTS * 3s -- RestOptions#RETRY_DELAY. you can check the logic here[1]). I'm not sure what the root cause is currently, could you please share the complete job JM log. thanks. Ivan Yang <[hidden email]> 于2020年7月25日周六 上午3:48写道:
|
Free forum by Nabble | Edit this page |