Deadlock in SafetyNetCloseableRegistry?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Deadlock in SafetyNetCloseableRegistry?

bupt_ljy

Hi,all

   I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.

“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.

  Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:

   

  Best, Jiayi Liao



Reply | Threaded
Open this post in threaded view
|

Re: Deadlock in SafetyNetCloseableRegistry?

bupt_ljy

Hi, all

Sorry for attaching this again. The flink version is 1.6 and the dead lock stack is 


"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000]

   java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

- waiting on <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)

- locked <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)

at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)


       This thread is created in AsyncCheckpointRunnable class and get stucked, so the next checkpoint can’t aquire the lock in performCheckpoint method and timeout. How can I avoid this?

       Best, Jiayi Liao

 Original Message 
Sender: bupt_ljy<[hidden email]>
Recipient: user<[hidden email]>
Date: Tuesday, Sep 11, 2018 22:22
Subject: Deadlock in SafetyNetCloseableRegistry?

Hi,all

   I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.

“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.

  Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:

   

  Best, Jiayi Liao