"Futures timed out" when trying to cancel a job with savepoint

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

"Futures timed out" when trying to cancel a job with savepoint

Julio Biason
Hey guys,

We just built a brand new Flink 1.4.0 cluster with HA and everything seems to be working fine, but we are getting some errors with savepoints.

For example, I have a running job

------------------ Running/Restarting Jobs -------------------
25.07.2018 11:55:18 : e5280bad25a7f19122f98483f94aba26 : Mr Banks (RUNNING)
--------------------------------------------------------------

If I try to create a savepoint with

flink savepoint e5280bad25a7f19122f98483f94aba26

The command just stays there and never returns (I waited about 10 minutes, with no response). Then I tried to cancel with savepoint:

flink cancel e5280bad25a7f19122f98483f94aba26 -s

And I got a

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]

I checked the jobmanager logs, but I can't see any problems; I checked the Hadoop logs for any errors (believing the problem may be in the underlying system), but it seems it did create the nodes properly -- at least, there are no errors there too.

Is there anything else I should check?

PS: My state is not that big (my napkin calculations say it's less than 1Gb) so it doesn't seem it's a problem with the state size taking too long to be saved.

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: "Futures timed out" when trying to cancel a job with savepoint

vino yang
Hi Julio,

We also encountered this problem on YARN, Savepoint has been completed, and JM has been successfully stopped, but the client is still trying to access the original JM port, which caused a timeout. It seems that this is a problem with Flink itself. I can't give you the answer to this question, but we have solved the timeout problem separately by these two steps:
1) Trigger savepoint first;
2) Then execute the cancel command;
I hope I can give you a reference. In addition, you can also create an issue on JIRA. Please remember to include detailed logs, exceptions, and version information to assist in analyzing the problem.

Thanks, vino.

2018-07-25 20:08 GMT+08:00 Julio Biason <[hidden email]>:
Hey guys,

We just built a brand new Flink 1.4.0 cluster with HA and everything seems to be working fine, but we are getting some errors with savepoints.

For example, I have a running job

------------------ Running/Restarting Jobs -------------------
25.07.2018 11:55:18 : e5280bad25a7f19122f98483f94aba26 : Mr Banks (RUNNING)
--------------------------------------------------------------

If I try to create a savepoint with

flink savepoint e5280bad25a7f19122f98483f94aba26

The command just stays there and never returns (I waited about 10 minutes, with no response). Then I tried to cancel with savepoint:

flink cancel e5280bad25a7f19122f98483f94aba26 -s

And I got a

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]

I checked the jobmanager logs, but I can't see any problems; I checked the Hadoop logs for any errors (believing the problem may be in the underlying system), but it seems it did create the nodes properly -- at least, there are no errors there too.

Is there anything else I should check?

PS: My state is not that big (my napkin calculations say it's less than 1Gb) so it doesn't seem it's a problem with the state size taking too long to be saved.

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554