Hey guys, We just built a brand new Flink 1.4.0 cluster with HA and everything seems to be working fine, but we are getting some errors with savepoints. For example, I have a running job ------------------ Running/Restarting Jobs ------------------- 25.07.2018 11:55:18 : e5280bad25a7f19122f98483f94aba26 : Mr Banks (RUNNING) -------------------------------------------------------------- If I try to create a savepoint with flink savepoint e5280bad25a7f19122f98483f94aba26 The command just stays there and never returns (I waited about 10 minutes, with no response). Then I tried to cancel with savepoint: flink cancel e5280bad25a7f19122f98483f94aba26 -s And I got a java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds] I checked the jobmanager logs, but I can't see any problems; I checked the Hadoop logs for any errors (believing the problem may be in the underlying system), but it seems it did create the nodes properly -- at least, there are no errors there too. Is there anything else I should check? PS: My state is not that big (my napkin calculations say it's less than 1Gb) so it doesn't seem it's a problem with the state size taking too long to be saved. -- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554 |
Hi Julio, We also encountered this problem on YARN, Savepoint has been completed, and JM has been successfully stopped, but the client is still trying to access the original JM port, which caused a timeout. It seems that this is a problem with Flink itself. I can't give you the answer to this question, but we have solved the timeout problem separately by these two steps: 1) Trigger savepoint first; 2) Then execute the cancel command; I hope I can give you a reference. In addition, you can also create an issue on JIRA. Please remember to include detailed logs, exceptions, and version information to assist in analyzing the problem. Thanks, vino. 2018-07-25 20:08 GMT+08:00 Julio Biason <[hidden email]>:
|
Free forum by Nabble | Edit this page |