(DEPRECATED) Apache Flink User Mailing List archive.

No Slots available exception in Apache Flink Job Manager while Scheduling

Classic

List

Threaded

2 messages Options

Josson Paul

No Slots available exception in Apache Flink Job Manager while Scheduling

Set up
------
Flink verson 1.8.3

Zookeeper HA cluster

1 ResourceManager/Dispatcher (Same Node)
1 TaskManager
4 pipelines running with various parallelism's

Issue
------

Occationally when the Job Manager gets restarted we noticed that all the pipelines are not getting scheduled. The error that is reporeted by the Job Manger is 'not enough slots are available'. This should not be the case because task manager was deployed with sufficient slots for the number of pipelines/parallelism we have.

We further noticed that the slot report sent by the taskmanger contains slots filled with old CANCELLED job Ids. I am not sure why the task manager still holds the details of the old jobs. Thread dump on the task manager confirms that old pipelines are not running.

It is not one or two slot report which wrong. If the issue occurs, all the slot reports that are sent by TM is wrong and contains old job ids report. This continues until I restart the TM.

Also I noticed that when we cancel a job the leader/leaderlatch entires in the zookeeper doesn't get cleared for that job. Is that expected?.

/leader/d8beed9c9261dcf191cc7fde46869b64/job_manager_lock

I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is not the issue happening in this case.

Thanks
Josson

Xintong Song

Re: No Slots available exception in Apache Flink Job Manager while Scheduling

Linking to the jira ticket, for the record.

https://issues.apache.org/jira/browse/FLINK-17560

Thank you~

Xintong Song

On Sat, May 9, 2020 at 2:14 AM Josson Paul <[hidden email]> wrote:

Set up
------
Flink verson 1.8.3

Zookeeper HA cluster

1 ResourceManager/Dispatcher (Same Node)
1 TaskManager
4 pipelines running with various parallelism's

Issue
------

Occationally when the Job Manager gets restarted we noticed that all the pipelines are not getting scheduled. The error that is reporeted by the Job Manger is 'not enough slots are available'. This should not be the case because task manager was deployed with sufficient slots for the number of pipelines/parallelism we have.

We further noticed that the slot report sent by the taskmanger contains slots filled with old CANCELLED job Ids. I am not sure why the task manager still holds the details of the old jobs. Thread dump on the task manager confirms that old pipelines are not running.
It is not one or two slot report which wrong. If the issue occurs, all the slot reports that are sent by TM is wrong and contains old job ids report. This continues until I restart the TM.

Also I noticed that when we cancel a job the leader/leaderlatch entires in the zookeeper doesn't get cleared for that job. Is that expected?.
/leader/d8beed9c9261dcf191cc7fde46869b64/job_manager_lock
I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is not the issue happening in this case.

--
Thanks
Josson