Hi, all
Sorry for attaching this again. The flink version is 1.6 and the dead lock stack is
"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)
This thread is created in AsyncCheckpointRunnable class and get stucked, so the next checkpoint can’t aquire the lock in performCheckpoint method and timeout. How can I avoid this?
Best, Jiayi Liao
Original Message
Date: Tuesday, Sep 11, 2018 22:22
Subject: Deadlock in SafetyNetCloseableRegistry?
Hi,all
I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.
“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.
Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:
Best, Jiayi Liao