Slots Leak Observed when

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Slots Leak Observed when

burgesschen
Hi guys,

Out team is observing a stability issue on our Standalone Flink clusters.

Background: The kafka cluster our flink jobs read from/ write to have some
issues and every 10 to15 mins one of the partition leaders switch. This
causes jobs that write to/ read from that topic fail and restart. Usually
this is not a problem since the jobs can restart and work with the new
partition leader. However, one of those restarts can make the jobs enter a
failing state and never be able to recover.

In the failing state. The jobmanager has exception:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 24, slots allocated: 12

During that time, 2 of the taskmanager are reporting that all the slots on
them are occupied, however, from the dashboard of the jobmanager, no job is
deployed to those 2 taskmanagers.

My guesstimation is that since the jobs restart fairly frequently, one of
the times the slots are not released properly when jobs failed, resulting in
the jobmanager falsely believing that those 2 taskmanagers' slots are still
occupied.

It does sound like an issue mentioned in
https://issues.apache.org/jira/browse/FLINK-9932
but we are using 1.6.2 and according to the jira ticket, this bug is fixed
in 1.6.2

Please let me know if you have any ideas or how we can prevent it. Thank you
so much!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Slots Leak Observed when

Xintong Song
Hi,
It would be helpful for understanding the problem if you could share the logs.

Thank you~

Xintong Song



On Wed, Jan 15, 2020 at 12:23 AM burgesschen <[hidden email]> wrote:
Hi guys,

Out team is observing a stability issue on our Standalone Flink clusters.

Background: The kafka cluster our flink jobs read from/ write to have some
issues and every 10 to15 mins one of the partition leaders switch. This
causes jobs that write to/ read from that topic fail and restart. Usually
this is not a problem since the jobs can restart and work with the new
partition leader. However, one of those restarts can make the jobs enter a
failing state and never be able to recover.

In the failing state. The jobmanager has exception:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 24, slots allocated: 12

During that time, 2 of the taskmanager are reporting that all the slots on
them are occupied, however, from the dashboard of the jobmanager, no job is
deployed to those 2 taskmanagers.

My guesstimation is that since the jobs restart fairly frequently, one of
the times the slots are not released properly when jobs failed, resulting in
the jobmanager falsely believing that those 2 taskmanagers' slots are still
occupied.

It does sound like an issue mentioned in
https://issues.apache.org/jira/browse/FLINK-9932
but we are using 1.6.2 and according to the jira ticket, this bug is fixed
in 1.6.2

Please let me know if you have any ideas or how we can prevent it. Thank you
so much!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Slots Leak Observed when

Till Rohrmann
Hi,

have you tried one of the latest Flink versions to see whether the problem still exists? I'm asking because there are some improvements which allow for slot reconciliation between the TaskManager and the JobMaster [1]. As a side note, the community is no longer supporting Flink 1.6.x.

For further debugging the DEBUG logs would be necessary.


Cheers,
Till

On Wed, Jan 15, 2020 at 7:25 AM Xintong Song <[hidden email]> wrote:
Hi,
It would be helpful for understanding the problem if you could share the logs.

Thank you~

Xintong Song



On Wed, Jan 15, 2020 at 12:23 AM burgesschen <[hidden email]> wrote:
Hi guys,

Out team is observing a stability issue on our Standalone Flink clusters.

Background: The kafka cluster our flink jobs read from/ write to have some
issues and every 10 to15 mins one of the partition leaders switch. This
causes jobs that write to/ read from that topic fail and restart. Usually
this is not a problem since the jobs can restart and work with the new
partition leader. However, one of those restarts can make the jobs enter a
failing state and never be able to recover.

In the failing state. The jobmanager has exception:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 24, slots allocated: 12

During that time, 2 of the taskmanager are reporting that all the slots on
them are occupied, however, from the dashboard of the jobmanager, no job is
deployed to those 2 taskmanagers.

My guesstimation is that since the jobs restart fairly frequently, one of
the times the slots are not released properly when jobs failed, resulting in
the jobmanager falsely believing that those 2 taskmanagers' slots are still
occupied.

It does sound like an issue mentioned in
https://issues.apache.org/jira/browse/FLINK-9932
but we are using 1.6.2 and according to the jira ticket, this bug is fixed
in 1.6.2

Please let me know if you have any ideas or how we can prevent it. Thank you
so much!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/