(DEPRECATED) Apache Flink User Mailing List archive.

ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Classic

List

Threaded

3 messages Options

son

ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Hi,

I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link)
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,

Son

Benchao Li

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Hi Son,

According to your description, maybe it's caused by the '/tmp' file system retain strategy which removes tmp files regularly.

Son Mai <[hidden email]> 于2019年2月27日周三上午10:27写道：

Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link)
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,
Son

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: [hidden email]; [hidden email]

Zhenghua Gao

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

In reply to this post by son

Seem like there is something wrong with RestServer and the RestClient didn't connect to it.

U can check the standalonesession log for investigating causes.

btw: The cause of "no cluster was found" is ur pid information was cleaned for some reason.

The pid information is stored in ur TMP directory, it should look like /tmp/flink-user-taskexecutor.pid or /tmp/flink-user-standalonesession.pid

On Wed, Feb 27, 2019 at 10:27 AM Son Mai <[hidden email]> wrote:

Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on the same host). At first, I'm able to submit and cancel jobs normally, the jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new one, I faced org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
At this moment, I was able to run flink list to list current jobs and flink cancel to cancel the job, but flink run failed. Exception was thrown and the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using stop-cluster, it said 'no cluster was found'. Then I had to find the pid of flink processes and stop them manually. Then if I run start-cluster to create a new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link)
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob (RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a while, usually more than 1 month). Why am I not able to submit job to flink after a while? What happened here?
Regards,
Son