Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug logging at "org.apache.flink" level. How do I investigate this? The garbage collection seems fine. There is no backpressure. This used to work as is with flink 1.9 without any issue.
Any pointers on how to investigate long time taken to complete checkpoint?
Omkar
|
I have followed this
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
and I am using taskmanager.memory.flink.size now instead of taskmanager.heap.size
From: Deshpande, Omkar <[hidden email]>
Sent: Monday, September 14, 2020 6:23 PM To: [hidden email] <[hidden email]> Subject: flink checkpoint timeout
This email is from an external sender.
Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug logging at "org.apache.flink" level. How do I investigate this? The garbage collection seems fine. There is no backpressure. This used to work as is with flink 1.9 without any issue.
Any pointers on how to investigate long time taken to complete checkpoint?
Omkar
|
Hi Omkar
First of all, you should check the web UI of checkpoint [1] to see whether many subtasks fail to complete in time or just few of them. The former one might be your checkpoint time out is not enough for current case. The later one might be some task stuck in
slow machine or cannot grab checkpoint lock to process sync phase of checkpointing, you can use thread dump [2] (needs to bump to Flink-1.11) or jstack to see what happened in java process.
Best
Yun Tang
From: Deshpande, Omkar <[hidden email]>
Sent: Tuesday, September 15, 2020 10:25 To: [hidden email] <[hidden email]> Subject: Re: flink checkpoint timeout
I have followed this
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
and I am using taskmanager.memory.flink.size now instead of taskmanager.heap.size
From: Deshpande, Omkar <[hidden email]>
Sent: Monday, September 14, 2020 6:23 PM To: [hidden email] <[hidden email]> Subject: flink checkpoint timeout
This email is from an external sender.
Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug logging at "org.apache.flink" level. How do I investigate this? The garbage collection seems fine. There is no backpressure. This used to work as is with flink 1.9 without any issue.
Any pointers on how to investigate long time taken to complete checkpoint?
Omkar
|
Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still get the thread dump. What should I be looking for in the thread dump?
From: Yun Tang <[hidden email]>
Sent: Monday, September 14, 2020 8:52 PM To: Deshpande, Omkar <[hidden email]>; [hidden email] <[hidden email]> Subject: Re: flink checkpoint timeout
This email is from an external sender.
Hi Omkar
First of all, you should check the web UI of checkpoint [1] to see whether many subtasks fail to complete in time or just few of them. The former one might be your checkpoint time out is not enough for current case. The later one might be some task stuck in
slow machine or cannot grab checkpoint lock to process sync phase of checkpointing, you can use thread dump [2] (needs to bump to Flink-1.11) or jstack to see what happened in java process.
Best
Yun Tang
From: Deshpande, Omkar <[hidden email]>
Sent: Tuesday, September 15, 2020 10:25 To: [hidden email] <[hidden email]> Subject: Re: flink checkpoint timeout
I have followed this
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
and I am using taskmanager.memory.flink.size now instead of taskmanager.heap.size
From: Deshpande, Omkar <[hidden email]>
Sent: Monday, September 14, 2020 6:23 PM To: [hidden email] <[hidden email]> Subject: flink checkpoint timeout
This email is from an external sender.
Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug logging at "org.apache.flink" level. How do I investigate this? The garbage collection seems fine. There is no backpressure. This used to work as is with flink 1.9 without any issue.
Any pointers on how to investigate long time taken to complete checkpoint?
Omkar
Screen Shot 2020-09-14 at 9.23.41 PM.png (307K) Download Attachment |
Hi You can try to find out is there is some hot method, or the snapshot stack is waiting for some lock. and maybe Best, Congxian Deshpande, Omkar <[hidden email]> 于2020年9月15日周二 下午12:30写道:
|
This thread seems to stuck in awaiting notification state -
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
From: Congxian Qiu <[hidden email]>
Sent: Monday, September 14, 2020 10:57 PM To: Deshpande, Omkar <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: flink checkpoint timeout
This email is from an external sender.
Hi
You can try to find out is there is some hot method, or the snapshot stack is waiting for some lock. and maybe
Best,
Congxian
Deshpande, Omkar <[hidden email]> 于2020年9月15日周二 下午12:30写道:
Screen Shot 2020-09-16 at 5.22.09 PM.png (105K) Download Attachment Screen Shot 2020-09-16 at 5.21.50 PM.png (601K) Download Attachment 17:16:34.txt (152K) Download Attachment |
These are the hostspot method. Any pointers on debugging this? The checkpoints keep timing out since migrating to 1.10 from 1.9
From: Deshpande, Omkar <[hidden email]>
Sent: Wednesday, September 16, 2020 5:27 PM To: Congxian Qiu <[hidden email]> Cc: [hidden email] <[hidden email]>; Yun Tang <[hidden email]> Subject: Re: flink checkpoint timeout
This email is from an external sender.
This thread seems to stuck in awaiting notification state -
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
From: Congxian Qiu <[hidden email]>
Sent: Monday, September 14, 2020 10:57 PM To: Deshpande, Omkar <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: flink checkpoint timeout
This email is from an external sender.
Hi
You can try to find out is there is some hot method, or the snapshot stack is waiting for some lock. and maybe
Best,
Congxian
Deshpande, Omkar <[hidden email]> 于2020年9月15日周二 下午12:30写道:
Screen Shot 2020-09-17 at 6.26.24 PM.png (152K) Download Attachment |
I'm not 100% sure but from the given information this might be related to FLINK-14498 [1] and partially relieved by FLINK-16645 [2]. @Omkar Could you try the 1.11.0 release out and see whether the issue disappeared? [hidden email] @yingjie could you also take a look here? Thanks. On Fri, 18 Sep 2020 at 09:28, Deshpande, Omkar <[hidden email]> wrote:
|
Hi Omkar, I don't see anything suspicious in regards to how Flink handles checkpointing; it simply took longer than 10m (configured checkpointing timeout) to checkpoint. The usual reason for long checkpointing times is backpressure. And indeed looking at your thread dump, I see that you have a sleep Fn in it. Can you shed some light on this? Why do you need it? If you want to throttle things, it's best to throttle at the source if possible. Alternatively, have the sleep as early as possible, so that it's ideally directly chained to the source. That would reduce the number of records in network buffers significantly, which speeds up checkpointing tremendously. Lastly, you might want to reduce the number of network buffers if you indeed have backpressure (check Web UI for that). On Tue, Oct 6, 2020 at 6:04 AM Yu Li <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |