Hi all!
I am having trouble explaining why my checkpoints take so much time, even though most partitions finish their checkpoints quite quickly. We are running a 96 partitions job that consumes and produces to Kafka and checkpoints to amazon S3. As you can see on the screenshot below, the State State is pretty well balanced and the Checkpoint Durations (Async and Sync) are always kept under 13 minutes. However, the End-To-End Duration of subtask 4 is 1h17m, which makes the checkpoint stuck at 99% for a very long time. We have observed that, for the last few checkpoints, subtask 4 was always causing this slowness. Have you ever observed such a behavior? What could be the reason for a huge end-to-end time on a single subtask? Thank you and don't hesitate to ask if you need more information Screenshot 2020-01-09 at 08.00.15.png (228K) Download Attachment |
Hi Robin, I noticed that I answered privately, so let me forward that to the user list. Please come back to the ML if you have more questions. Best, Arvid On Thu, Jan 9, 2020 at 5:47 PM Robin Cassan <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |