Issue with job status

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with job status

Vijay Bhaskar
Hi
I am using flink 1.9 and facing the below issue
Suppose i have deployed any job and in case there are not enough slots, then the job is stuck in waiting for slots. But flink job status is showing it as "RUNNING"  actually it's not.
For me this is looking like a bug. It impacts our production while monitoring any duplicate jobs.
Moreover when we issue the flink stop command to stop these kinds of jobs, they are not terminating because stop is associated with savepoint, which actually fails. So the job is not stopping at all, it is forever stuck, until manually we cancel it using cli. We don't want manual intervention.
If this is bug, i want to open a jira ticket for same

Regards
Bhaskar
Reply | Threaded
Open this post in threaded view
|

Re: Issue with job status

rmetzger0
Hi Bhaskar,

The definition of when a job is marked as RUNNING in Flink is debatable.
For a streaming job, RUNNING is when all tasks are running, however for a batch job, if some tasks are running, it is RUNNING already.
Since the scheduler does not distinguish between these types of jobs, currently the definition of RUNNING is if some tasks are running.

Since you mention that you are using the job status = RUNNING for monitoring, you could also monitor the number of total tasks vs running tasks. If all tasks are running, you can consider the overall job running (per your definition).

For the problem with "stop": Since not all tasks are running, the checkpoint for stopping the job won't go through, that's why it will fail.

I understand that these semantics are not optimal for your use case, but I hope that you can work around them based on my response.

Best,
Robert

On Thu, Jun 18, 2020 at 4:23 PM Vijay Bhaskar <[hidden email]> wrote:
Hi
I am using flink 1.9 and facing the below issue
Suppose i have deployed any job and in case there are not enough slots, then the job is stuck in waiting for slots. But flink job status is showing it as "RUNNING"  actually it's not.
For me this is looking like a bug. It impacts our production while monitoring any duplicate jobs.
Moreover when we issue the flink stop command to stop these kinds of jobs, they are not terminating because stop is associated with savepoint, which actually fails. So the job is not stopping at all, it is forever stuck, until manually we cancel it using cli. We don't want manual intervention.
If this is bug, i want to open a jira ticket for same

Regards
Bhaskar