How does at least once checkpointing work

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How does at least once checkpointing work

Rex Fenley
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: How does at least once checkpointing work

Yuan Mei
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.


In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: How does at least once checkpointing work

Rex Fenley
Thanks for the info.

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.

Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.


In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: How does at least once checkpointing work

Yuan Mei

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

at least once usually works if the use case can tolerate a certain level of duplication or the computation is idempotent.
 
Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.


Not completely sure how the "retract (-1)" and "insert (+1)" work in your case, but "input data" that leads to a state change (count/sum change) is possible to be played twice after a recovery.
 
Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.


In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: How does at least once checkpointing work

Rex Fenley
Thanks!

On Tue, Jan 12, 2021 at 1:56 AM Yuan Mei <[hidden email]> wrote:

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

at least once usually works if the use case can tolerate a certain level of duplication or the computation is idempotent.
 
Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.


Not completely sure how the "retract (-1)" and "insert (+1)" work in your case, but "input data" that leads to a state change (count/sum change) is possible to be played twice after a recovery.
 
Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.


In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US