http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Checkpoint-fail-due-to-timeout-tp42125p42473.html
Hi Alexey,
You should definitely investigate why the job is stuck.
1. First of all, is it completely stuck, or is something moving? - Use Flink metrics [1] (number bytes/records processed), and go through all of the operators/tasks to check this.
2. The stack traces like the one you quoted:
> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:319)
> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:291)
you can most likely ignore. Such Task ("Legacy Source Thread - Source: digital-itx-eastus2 -> Filter (6/6)#0") is backpressured and the problem lies downstream.
3. To check what tasks are backpressured, you can also use Flink metrics - check "isBackPressured" metric. Again, back pressured tasks are most likely not the source of the problem. Check downstream from the back pressured task.
4. First (the most upstream) not backpressured task, which is accepting/processing data from some backpressured tasks is the interesting one. It's causing backpressure and you need to investigate what is the problem. Take a look at it's stack traces, maybe attach a remote profiler and profile it's code (if it's making slow progress). Maybe it's stuck in your user code doing something.
Please let us know what you have found out.
Piotrek