Hi,
First of all, happy new year! It can be a very basic question but I have something to clarify in my head. my flink-conf.yaml is as follows (note that I didn't specify the value of "execution-checkpointing-externalized-checkpoint-retention [1]"): #... execution.checkpointing.interval: 20min
execution.checkpointing.min-pause: 1min state.backend: rocksdb state.backend.incremental: true state.checkpoints.dir: hdfs:///flink-jobs/ckpts state.checkpoints.num-retained: 10 state.savepoints.dir: hdfs:///flink-jobs/svpts #... And the checkpoint configuration is shown as follows in Web UI (note that "Persist Checkpoints Externally" is "Disabled" in the final row): According to [2],
So I've thought the metadata of a checkpoint is only on JobManager's memory and not stored on HDFS unless "execution-checkpointing-externalized-checkpoint-retention" is set. However, even without setting the value, every checkpoint already contains its own metadata: [user@devflink conf]$ hdfs dfs -ls /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/* Found 1 items -rw-r--r-- 3 user hdfs 163281 2021-01-04 14:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-945/_metadata Found 1 items -rw-r--r-- 3 user hdfs 163281 2021-01-04 14:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-946/_metadata Found 1 items -rw-r--r-- 3 user hdfs 163157 2021-01-04 15:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-947/_metadata Found 1 items -rw-r--r-- 3 user hdfs 156684 2021-01-04 15:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-948/_metadata Found 1 items -rw-r--r-- 3 user hdfs 147280 2021-01-04 15:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-949/_metadata Found 1 items -rw-r--r-- 3 user hdfs 147280 2021-01-04 16:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-950/_metadata Found 1 items -rw-r--r-- 3 user hdfs 162937 2021-01-04 16:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-951/_metadata Found 1 items -rw-r--r-- 3 user hdfs 175089 2021-01-04 16:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-952/_metadata Found 1 items -rw-r--r-- 3 user hdfs 173289 2021-01-04 17:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-953/_metadata Found 1 items -rw-r--r-- 3 user hdfs 153951 2021-01-04 17:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-954/_metadata Found 21 items -rw-r--r-- 3 user hdfs 78748 2021-01-04 14:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/05d76f4e-3d9c-420c-8b87-077fc9880d9a -rw-r--r-- 3 user hdfs 23905 2021-01-04 15:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/0b9d9323-9f10-4fc2-8fcc-a9326448b07c -rw-r--r-- 3 user hdfs 81082 2021-01-04 16:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/0f6779d0-3a2e-4a94-be9b-d9d6710a7ea0 -rw-r--r-- 3 user hdfs 23905 2021-01-04 16:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/107b3b74-634a-462c-bf40-1d4886117aa9 -rw-r--r-- 3 user hdfs 78748 2021-01-04 14:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/18a538c6-d40e-48c0-a965-d65be407a124 -rw-r--r-- 3 user hdfs 83550 2021-01-04 16:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/24ed9c4a-0b8e-45d4-95b8-64547cb9c541 -rw-r--r-- 3 user hdfs 23905 2021-01-04 17:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/35ee9665-7c1f-4407-beb5-fbb312d84907 -rw-r--r-- 3 user hdfs 47997 2021-01-04 11:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/36363172-c401-4d60-a970-cfb2b3cbf058 -rw-r--r-- 3 user hdfs 81082 2021-01-04 15:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/43aecc8c-145f-43ba-81a8-b0ce2c3498f4 -rw-r--r-- 3 user hdfs 79898 2021-01-04 15:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/5743f278-fc50-4c4a-b14e-89bfdb2139fa -rw-r--r-- 3 user hdfs 23905 2021-01-04 16:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/67e16688-c48c-409b-acac-e7091a84d548 -rw-r--r-- 3 user hdfs 23905 2021-01-04 16:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/773ef43d-936a-4f33-9b0a-d3ff090637c7 -rw-r--r-- 3 user hdfs 82046 2021-01-04 16:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/81ac58ef-8810-4fa6-ad8f-a5ec0c0cc885 -rw-r--r-- 3 user hdfs 86089 2021-01-04 17:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/8e202c6a-f702-487b-bd00-43739a8c79a2 -rw-r--r-- 3 user hdfs 84875 2021-01-04 17:05 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/a6d4db40-2efe-495c-8e94-a9c31876e4d3 -rw-r--r-- 3 user hdfs 23905 2021-01-04 17:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/b54c5d30-b152-4fba-b0ac-dba598c93646 -rw-r--r-- 3 user hdfs 23905 2021-01-04 15:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/c36433cf-9e79-46ee-a93f-fe042e3c583f -rw-r--r-- 3 user hdfs 23905 2021-01-04 14:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/e8a27366-4764-4ef0-ae6b-85ed936f6935 -rw-r--r-- 3 user hdfs 80747 2021-01-04 15:25 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/eb6476de-1e35-4d0c-bc6b-2f3214abfffd -rw-r--r-- 3 user hdfs 23905 2021-01-04 15:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/efd13c04-cbac-4c68-a132-1f9dc9afc7b4 -rw-r--r-- 3 user hdfs 23905 2021-01-04 14:45 /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/shared/f63ba16a-6664-49b6-878f-efba342270be And resuming from a checkpoint directory (e.g. /flink-jobs/ckpts/76fc265c44ef44ae343ab15868155de6/chk-954) is perfectly working as wished. So I'm wondering - is every checkpoint already meant to have its metadata on HDFS even without setting the value of "execution-checkpointing-externalized-checkpoint-retention"? - is setting "execution-checkpointing-externalized-checkpoint-retention" only needed when I want to retain checkpoints in case a job fails or is intentionally cancelled? Best, Dongwon |
Hi Dongwon, Happy new year! One meta file would be stored on top of HDFS even if external-checkpoint is not enabled. If external checkpoint is not enabled, flink would delete all the checkpoints on exit, and if external checkpoint is enabled, the checkpoints would be kept on cancel or fail cases, according to the settings. Thus for the second issue, I think it would be yes. Best, Yun
|
Thanks Yun for explanation :) it really helps a lot. A related question is how I can enable externalized checkpoint in flink-conf.yaml? It seems like setting "execution-checkpointing-externalized-checkpoint-retention" to RETAIN_ON_CANCELLATION or DELETE_ON_CANCELLATION on flink-conf.yaml is not enough. The final row shows that it is not enabled even after setting it to either one (FYI, I'm using Flink-1.12.0). Even setting it to RETAIN_ON_CANCELLATION, I found that a cancelled job cleans up all its checkpoints on HDFS, which is against the definition of RETAIN_ON_CANCELLATION. So I have to add the following lines in my Flink application:
Now the external checkpoint seems to be enabled: Does it have nothing to do with FLIP-59? Am I missing something or is it a bug? Thanks, Dongwon On Tue, Jan 5, 2021 at 12:04 AM Yun Gao <[hidden email]> wrote:
|
Hi Dongwon,
What's the actual setting of this option? Setting the 'execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION' should work.
This is verified in tests and I also confirm this in my submitted jobs.
Best
Yun Tang
From: Dongwon Kim <[hidden email]>
Sent: Tuesday, January 5, 2021 1:46 To: Yun Gao <[hidden email]> Cc: user <[hidden email]> Subject: Re: Is chk-$id/_metadata created regardless of enabling externalized checkpoints? Thanks Yun for explanation :) it really helps a lot.
A related question is how I can enable externalized checkpoint in flink-conf.yaml?
It seems like setting "execution-checkpointing-externalized-checkpoint-retention" to RETAIN_ON_CANCELLATION or DELETE_ON_CANCELLATION on flink-conf.yaml is not enough.
The final row shows that it is not enabled even after setting it to either one (FYI, I'm using Flink-1.12.0).
Even setting it to RETAIN_ON_CANCELLATION, I found that a cancelled job cleans up all its checkpoints on HDFS, which is against the definition of RETAIN_ON_CANCELLATION.
So I have to add the following lines in my Flink application:
Now the external checkpoint seems to be enabled:
Does it have nothing to do with FLIP-59? Am I missing something or is it a bug?
Thanks,
Dongwon
On Tue, Jan 5, 2021 at 12:04 AM Yun Gao <[hidden email]> wrote:
|
Yun, I just checked that it worked. Sorry for the confusion (I might modify flink-conf.yaml on a different location..T.T) Best, Dongwon On Tue, Jan 5, 2021 at 3:38 PM Yun Tang <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |