Cleaning of state snapshot in state backend(HDFS)

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Cleaning of state snapshot in state backend(HDFS)

Garvit Sharma
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re:Cleaning of state snapshot in state backend(HDFS)

gerryzhou
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

Garvit Sharma
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

gerryzhou
Hi Garvit,

Let's say you clearing the state at timestamp t1, then the checkpoints completed before t1 will still contains the data you cleared. But the future checkpoints won't contain the cleared data again. But I'm not sure what you meaning by the cleared state, you can only clear a key-value pair of the state currently, you can't cleared the whole state currently.

Best, Sihua

On 06/21/2018 15:41[hidden email] wrote:
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

Garvit Sharma
I am maintaining state data for a key in ValueState. As per [0] I can clear() state for that key.


Please let me know.

Thanks,


On Thu, Jun 21, 2018 at 1:19 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

Let's say you clearing the state at timestamp t1, then the checkpoints completed before t1 will still contains the data you cleared. But the future checkpoints won't contain the cleared data again. But I'm not sure what you meaning by the cleared state, you can only clear a key-value pair of the state currently, you can't cleared the whole state currently.

Best, Sihua

On 06/21/2018 15:41[hidden email] wrote:
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

Garvit Sharma
Now, after clearing state for a key, I don't want that redundant data in the state backend. This is my concern.

Please let me know if there are any gaps.

Thanks,

On Thu, Jun 21, 2018 at 1:31 PM Garvit Sharma <[hidden email]> wrote:
I am maintaining state data for a key in ValueState. As per [0] I can clear() state for that key.


Please let me know.

Thanks,


On Thu, Jun 21, 2018 at 1:19 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

Let's say you clearing the state at timestamp t1, then the checkpoints completed before t1 will still contains the data you cleared. But the future checkpoints won't contain the cleared data again. But I'm not sure what you meaning by the cleared state, you can only clear a key-value pair of the state currently, you can't cleared the whole state currently.

Best, Sihua

On 06/21/2018 15:41[hidden email] wrote:
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

gerryzhou
Yes, you can clear the state for a key(the currently active key), if you clear it, it means that you have also cleaned it from the state backend, and the future checpoints won't contains the key anymore unless you add it again.

Best, Sihua


On 06/21/2018 16:04[hidden email] wrote:
Now, after clearing state for a key, I don't want that redundant data in the state backend. This is my concern.

Please let me know if there are any gaps.

Thanks,

On Thu, Jun 21, 2018 at 1:31 PM Garvit Sharma <[hidden email]> wrote:
I am maintaining state data for a key in ValueState. As per [0] I can clear() state for that key.


Please let me know.

Thanks,


On Thu, Jun 21, 2018 at 1:19 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

Let's say you clearing the state at timestamp t1, then the checkpoints completed before t1 will still contains the data you cleared. But the future checkpoints won't contain the cleared data again. But I'm not sure what you meaning by the cleared state, you can only clear a key-value pair of the state currently, you can't cleared the whole state currently.

Best, Sihua

On 06/21/2018 15:41[hidden email] wrote:
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning of state snapshot in state backend(HDFS)

Garvit Sharma
Thank you for the clarification. 

On Thu, Jun 21, 2018 at 1:36 PM sihua zhou <[hidden email]> wrote:
Yes, you can clear the state for a key(the currently active key), if you clear it, it means that you have also cleaned it from the state backend, and the future checpoints won't contains the key anymore unless you add it again.

Best, Sihua


On 06/21/2018 16:04[hidden email] wrote:
Now, after clearing state for a key, I don't want that redundant data in the state backend. This is my concern.

Please let me know if there are any gaps.

Thanks,

On Thu, Jun 21, 2018 at 1:31 PM Garvit Sharma <[hidden email]> wrote:
I am maintaining state data for a key in ValueState. As per [0] I can clear() state for that key.


Please let me know.

Thanks,


On Thu, Jun 21, 2018 at 1:19 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

Let's say you clearing the state at timestamp t1, then the checkpoints completed before t1 will still contains the data you cleared. But the future checkpoints won't contain the cleared data again. But I'm not sure what you meaning by the cleared state, you can only clear a key-value pair of the state currently, you can't cleared the whole state currently.

Best, Sihua

On 06/21/2018 15:41[hidden email] wrote:
So, would it delete all the files in HDFS associated with the cleared state?

On Thu, Jun 21, 2018 at 12:58 PM sihua zhou <[hidden email]> wrote:
Hi Garvit,

> Now, let's say, we clear the state. Would the state data be removed from HDFS too?

The state data would not be removed from HDFS immediately, if you clear the state in your job. But after you clearing the state in your job, the later completed checkpoint won't contain the state any more.

How does Flink manage to clear the state data from state backend on clearing the keyed state?

1. you can use the {{tate.checkpoints.num-retained}} to set the number of the completed checkpoint maintanced on HDFS.
2. If you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)}} then the checkpoints on HDFS will be removed once your job is finished(or cancled). And if you set {{env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup. RETAIN_ON_CANCELLATION)}} then the checkpoints will be remained.



Additional, I'd like to give a bref info of the checkpoint on HDFS. In a nutshell, what ever you did with the state in your running job, they only effect the content on the state backend locally. When checkpointing, flink takes a snapshot of the local state backend, and send it to the checkpoint target directory(in your case, the HDFS). The checkpoints on the HDFS looks like the periodic snapshot of the state backend of your job, they can be created or deleted but never be changed. Maybe Stefan(cc) could give you more professional information and plz correct me if I'm incorrect.

Best, Sihua
On 06/21/2018 14:40[hidden email] wrote:
Hi,

Consider a managed keyed state backed by HDFS with checkpointing enabled. Now, as the state grows the state data will be saved on HDFS.

Now, let's say, we clear the state. Would the state data be removed from HDFS too?

How does Flink manage to clear the state data from state backend on clearing the keyed state?

--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.


--

Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.