(DEPRECATED) Apache Flink User Mailing List archive.

Deadlock in SafetyNetCloseableRegistry?

Classic

List

Threaded

2 messages Options

bupt_ljy

Deadlock in SafetyNetCloseableRegistry?

Hi,all

I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.

“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.

Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:

Best, Jiayi Liao

out (68K) Download Attachment

bupt_ljy

Re: Deadlock in SafetyNetCloseableRegistry?

Hi, all

Sorry for attaching this again. The flink version is 1.6 and the dead lock stack is

"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000]

java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

- waiting on <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)

- locked <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)

at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)

This thread is created in AsyncCheckpointRunnable class and get stucked, so the next checkpoint can’t aquire the lock in performCheckpoint method and timeout. How can I avoid this?

Best, Jiayi Liao

Original Message

Sender: bupt_ljy<[hidden email]>

Recipient: user<[hidden email]>

Date: Tuesday, Sep 11, 2018 22:22

Subject: Deadlock in SafetyNetCloseableRegistry?

Hi,all

I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.

“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.

Best, Jiayi Liao