Hi folks, I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better.
Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234 What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint my_user/featureflow/foo-datacenter/cluster_name/my_flink_job my_user/featureflow/foo-datacenter/cluster_name my_user/featureflow/foo-datacenter my_user/featureflow my_user Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List
calls for each parent ‘directory’ / blob all the way to the top was normal / expected?
We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with this
I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.
Thanks, -- Piyush |
Hi Piyush
Which version of Flink do you use? After Flink-1.5, Flink would not call any "List" operation on checkpoint side with FLINK-8540 [1]. The only left "List" operation would be used when reading files in file input format. In a nut shell, these "List" calls should
not come from Flink if you're using Flink-1.5+
Best
Yun Tang
From: Piyush Narang <[hidden email]>
Sent: Saturday, March 7, 2020 6:15 To: user <[hidden email]> Subject: Understanding n LIST calls as part of checkpointing Hi folks,
I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better. Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234
What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint my_user/featureflow/foo-datacenter/cluster_name/my_flink_job my_user/featureflow/foo-datacenter/cluster_name my_user/featureflow/foo-datacenter my_user/featureflow my_user
Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List calls for each parent ‘directory’ / blob all the way to the top was normal / expected?
We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with this I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.
Thanks,
-- Piyush
|
Hi Yun, Thanks for getting back. We’re on a fork of Flink 1.9 (basically 1.9 with some backported fixes from 1.10 and a couple of minor patches) -
https://github.com/criteo-forks/flink/tree/criteo-1.9 I’ll check the jira + fix and see if there’s something that was potentially missed.
-- Piyush From: Yun Tang <[hidden email]> Hi Piyush Which version of Flink do you use? After Flink-1.5, Flink would not call any "List" operation on checkpoint side with FLINK-8540 [1]. The only left "List" operation would
be used when reading files in file input format. In a nut shell, these "List" calls should not come from Flink if you're using Flink-1.5+ Best Yun Tang From: Piyush Narang <[hidden email]> Hi folks, I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better.
Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234 What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint my_user/featureflow/foo-datacenter/cluster_name/my_flink_job my_user/featureflow/foo-datacenter/cluster_name my_user/featureflow/foo-datacenter my_user/featureflow my_user Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List
calls for each parent ‘directory’ / blob all the way to the top was normal / expected?
We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with
this I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.
Thanks, -- Piyush |
Free forum by Nabble | Edit this page |