Hi, I'm having a question regarding Flink. I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran. However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI. When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally. The shortened stack-trace: (full stack-trace at google docs link) org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531) at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261) Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph. Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive. Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive. ... 37 more The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here? Regards, Son |
Hi Son, According to your description, maybe it's caused by the '/tmp' file system retain strategy which removes tmp files regularly. Son Mai <[hidden email]> 于2019年2月27日周三 上午10:27写道:
Benchao Li School of Electronics Engineering and Computer Science, Peking University Tel:+86-15650713730 Email: [hidden email]; [hidden email] |
In reply to this post by son
Seem like there is something wrong with RestServer and the RestClient didn't connect to it. U can check the standalonesession log for investigating causes. btw: The cause of "no cluster was found" is ur pid information was cleaned for some reason. The pid information is stored in ur TMP directory, it should look like /tmp/flink-user-taskexecutor.pid or /tmp/flink-user-standalonesession.pid On Wed, Feb 27, 2019 at 10:27 AM Son Mai <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |