ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

classic Classic list List threaded Threaded
3 messages Options
son
Reply | Threaded
Open this post in threaded view
|

ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

son
Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,

Son

Reply | Threaded
Open this post in threaded view
|

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Benchao Li
Hi Son, 

According to your description, maybe it's caused by the '/tmp' file system retain strategy which removes tmp files regularly.

Son Mai <[hidden email]> 于2019年2月27日周三 上午10:27写道:
Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,

Son



--
Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: [hidden email]; [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Zhenghua Gao
In reply to this post by son
Seem like there is something wrong with RestServer and the RestClient didn't connect to it.
U can check the standalonesession log for investigating causes.

btw: The cause of  "no cluster was found"  is ur pid information was cleaned for some reason.
The pid information is stored in ur TMP directory, it should look like /tmp/flink-user-taskexecutor.pid or /tmp/flink-user-standalonesession.pid

On Wed, Feb 27, 2019 at 10:27 AM Son Mai <[hidden email]> wrote:
Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,

Son