"Futures timed out" when trying to cancel a job with savepoint

Posted by Julio Biason on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Futures-timed-out-when-trying-to-cancel-a-job-with-savepoint-tp21808.html

Hey guys,

We just built a brand new Flink 1.4.0 cluster with HA and everything seems to be working fine, but we are getting some errors with savepoints.

For example, I have a running job

------------------ Running/Restarting Jobs -------------------
25.07.2018 11:55:18 : e5280bad25a7f19122f98483f94aba26 : Mr Banks (RUNNING)
--------------------------------------------------------------

If I try to create a savepoint with

flink savepoint e5280bad25a7f19122f98483f94aba26

The command just stays there and never returns (I waited about 10 minutes, with no response). Then I tried to cancel with savepoint:

flink cancel e5280bad25a7f19122f98483f94aba26 -s

And I got a

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]

I checked the jobmanager logs, but I can't see any problems; I checked the Hadoop logs for any errors (believing the problem may be in the underlying system), but it seems it did create the nodes properly -- at least, there are no errors there too.

Is there anything else I should check?

PS: My state is not that big (my napkin calculations say it's less than 1Gb) so it doesn't seem it's a problem with the state size taking too long to be saved.

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554