Hi guys,
Out team is observing a stability issue on our Standalone Flink clusters. Background: The kafka cluster our flink jobs read from/ write to have some issues and every 10 to15 mins one of the partition leaders switch. This causes jobs that write to/ read from that topic fail and restart. Usually this is not a problem since the jobs can restart and work with the new partition leader. However, one of those restarts can make the jobs enter a failing state and never be able to recover. In the failing state. The jobmanager has exception: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 24, slots allocated: 12 During that time, 2 of the taskmanager are reporting that all the slots on them are occupied, however, from the dashboard of the jobmanager, no job is deployed to those 2 taskmanagers. My guesstimation is that since the jobs restart fairly frequently, one of the times the slots are not released properly when jobs failed, resulting in the jobmanager falsely believing that those 2 taskmanagers' slots are still occupied. It does sound like an issue mentioned in https://issues.apache.org/jira/browse/FLINK-9932 but we are using 1.6.2 and according to the jira ticket, this bug is fixed in 1.6.2 Please let me know if you have any ideas or how we can prevent it. Thank you so much! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi, It would be helpful for understanding the problem if you could share the logs. Thank you~ Xintong Song On Wed, Jan 15, 2020 at 12:23 AM burgesschen <[hidden email]> wrote: Hi guys, |
Hi, have you tried one of the latest Flink versions to see whether the problem still exists? I'm asking because there are some improvements which allow for slot reconciliation between the TaskManager and the JobMaster [1]. As a side note, the community is no longer supporting Flink 1.6.x. For further debugging the DEBUG logs would be necessary. Cheers, Till On Wed, Jan 15, 2020 at 7:25 AM Xintong Song <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |