(DEPRECATED) Apache Flink User Mailing List archive.

How does at least once checkpointing work

Classic

List

Threaded

5 messages Options

Rex Fenley

How does at least once checkpointing work

Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Yuan Mei

Re: How does at least once checkpointing work

Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.

https://ci.apache.org/projects/flink/flink-docs-release-1.12/concepts/stateful-stream-processing.html#exactly-once-vs-at-least-once

In a high level, at least once mode for a task with multiple input channels

1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.

2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:

Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley

Re: How does at least once checkpointing work

Thanks for the info.

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.

Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:

Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.

https://ci.apache.org/projects/flink/flink-docs-release-1.12/concepts/stateful-stream-processing.html#exactly-once-vs-at-least-once

In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Yuan Mei

Re: How does at least once checkpointing work

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

at least once usually works if the use case can tolerate a certain level of duplication or the computation is idempotent.

Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.

Not completely sure how the "retract (-1)" and "insert (+1)" work in your case, but "input data" that leads to a state change (count/sum change) is possible to be played twice after a recovery.

Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.

https://ci.apache.org/projects/flink/flink-docs-release-1.12/concepts/stateful-stream-processing.html#exactly-once-vs-at-least-once

In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley

Re: How does at least once checkpointing work

Thanks!

On Tue, Jan 12, 2021 at 1:56 AM Yuan Mei <[hidden email]> wrote:

It sounds like any state which does not have some form of uniqueness could end up being incorrect.

at least once usually works if the use case can tolerate a certain level of duplication or the computation is idempotent.

Specifically in my case, all rows passing through the execution graph have unique ids. However, any operator from groupby foreign_key then sum/count could end up with an inconsistent count. Normally a retract (-1) and then insert (+1) would keep the count correct, but with "at least once" a retract (-1) may be from epoch n+1 and therefore played twice, making the count equal less than it should actually be.

Not completely sure how the "retract (-1)" and "insert (+1)" work in your case, but "input data" that leads to a state change (count/sum change) is possible to be played twice after a recovery.

Am I understanding this correctly?

Thanks!

On Mon, Jan 11, 2021 at 10:06 PM Yuan Mei <[hidden email]> wrote:
Hey Rex,

You probably will find the link below helpful; it explains how at-least-once (does not have alignment) is different from exactly-once(needs alignment). It also explains how the alignment phase is skipped in the at-least-once mode.

https://ci.apache.org/projects/flink/flink-docs-release-1.12/concepts/stateful-stream-processing.html#exactly-once-vs-at-least-once

In a high level, at least once mode for a task with multiple input channels
1. does NOT block processing to wait for barriers from all inputs, meaning the task keeps processing data after receiving a barrier even if it has multiple inputs.
2. but still, a task takes a snapshot after seeing the checkpoint barrier from all input channels.

In this way, a Snapshot N may contain data change coming from Epoch N+1; that's where "at least once" comes from.

On Tue, Jan 12, 2021 at 1:03 PM Rex Fenley <[hidden email]> wrote:
Hello,

We're using the TableAPI and want to optimize for checkpoint alignment times. We received some advice to possibly use at-least-once. I'd like to understand how checkpointing works in at-least-once mode so I understand the caveats and can evaluate whether or not that will work for us.

Thanks!
--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US