Checkpoint for exact-once proccessing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint for exact-once proccessing

Francis Aranda
Hello, 

I'm trying to understand the process of checkpoint processing for exact-once in Flink, and I have some doubts.

The documentation says that when there is a failure and the state of an operator is restored, the already processed records are deleted based on their identifiers.

My doubts is, how these identifiers between two checkpoints are maintained? Every time a new input record comes to the stateful operator, Flink persists it before making the checkpoint? Otherwise, there may be messages to reprocess after a failure.

Thanks !!!
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint for exact-once proccessing

Stephan Ewen
Hi!

I think there is a misunderstanding. There are no identifiers maintained and no individual records deleted.

On recovery, all operators reset their state to a consistent snapshot: https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/stream_checkpointing.html


Greetings,
Stephan


On Wed, Jan 13, 2016 at 11:08 AM, Don Frascuchon <[hidden email]> wrote:
Hello, 

I'm trying to understand the process of checkpoint processing for exact-once in Flink, and I have some doubts.

The documentation says that when there is a failure and the state of an operator is restored, the already processed records are deleted based on their identifiers.

My doubts is, how these identifiers between two checkpoints are maintained? Every time a new input record comes to the stateful operator, Flink persists it before making the checkpoint? Otherwise, there may be messages to reprocess after a failure.

Thanks !!!

Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint for exact-once proccessing

Francis Aranda
Hi Stephan,

Thanks for your quickly response.

So, consider an operator task with two processed records and no barrier incoming. If the task fail and must be records, the last consistent snapshot will be used, which no includes information about the processed but no checkpointed  records. What about this situation? The registers will be resent to failed task after, or will be discarded? How flink manage information about this records for exact-once guarantees? The user function inside operator must be idempotent (i think about some kind of persistence in  a sink task)

Thanks in advance !


El mié., 13 ene. 2016 a las 11:17, Stephan Ewen (<[hidden email]>) escribió:
Hi!

I think there is a misunderstanding. There are no identifiers maintained and no individual records deleted.

On recovery, all operators reset their state to a consistent snapshot: https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/stream_checkpointing.html


Greetings,
Stephan


On Wed, Jan 13, 2016 at 11:08 AM, Don Frascuchon <[hidden email]> wrote:
Hello, 

I'm trying to understand the process of checkpoint processing for exact-once in Flink, and I have some doubts.

The documentation says that when there is a failure and the state of an operator is restored, the already processed records are deleted based on their identifiers.

My doubts is, how these identifiers between two checkpoints are maintained? Every time a new input record comes to the stateful operator, Flink persists it before making the checkpoint? Otherwise, there may be messages to reprocess after a failure.

Thanks !!!

Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint for exact-once proccessing

Tzu-Li Tai
Hi Francis,

A part of every complete snapshot is the record positions associated with the barrier that triggered the checkpointing of this snapshot. The snapshot is completed only when all the records within the checkpoint reaches the sink. When a topology fails, all the operators' state will fall back to the latest complete snapshot (incomplete snapshots will be ignored). The data source will also fall back to the position recorded with this snapshot, so even if there are repeatedly read data records after the restore, the restored operator's state are also clean of the records effect. This way, Flink guarantees exactly-once effects of each record on every operator's state. The user functions in operators need not to be implemented idempotent.

Hope this helps answer your question!

Cheers,
Gordon
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint for exact-once proccessing

Stephan Ewen
Thanks, Gordon, for the nice answer!

One thing is important to add: Exactly-once refers to state maintained by Flink. All side effects (changes made to the "outside" world), which includes sinks, need in fact to be idempotent, or will only have "at-least once" semantics.

In practice, this works often very well, because results can be computes with "exactly once" semantics in Flink and are then sent "one or more times" to the outside world (for example a database). If that database simply overwrites/replaces values for keys (think upsert operation), this gives end-to-end exactly-once semantics.

Greetings,
Stephan


On Wed, Jan 13, 2016 at 1:31 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Francis,

A part of every complete snapshot is the record positions associated with
the barrier that triggered the checkpointing of this snapshot. The snapshot
is completed only when all the records within the checkpoint reaches the
sink. When a topology fails, all the operators' state will fall back to the
latest complete snapshot (incomplete snapshots will be ignored). The data
source will also fall back to the position recorded with this snapshot, so
even if there are repeatedly read data records after the restore, the
restored operator's state are also clean of the records effect. This way,
Flink guarantees exactly-once effects of each record on every operator's
state. The user functions in operators need not to be implemented
idempotent.

Hope this helps answer your question!

Cheers,
Gordon



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-for-exact-once-proccessing-tp4261p4264.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint for exact-once proccessing

Francis Aranda
Thanks to both !!. 

That's help me to understand the recovery process

El mié., 13 ene. 2016 a las 14:01, Stephan Ewen (<[hidden email]>) escribió:
Thanks, Gordon, for the nice answer!

One thing is important to add: Exactly-once refers to state maintained by Flink. All side effects (changes made to the "outside" world), which includes sinks, need in fact to be idempotent, or will only have "at-least once" semantics.

In practice, this works often very well, because results can be computes with "exactly once" semantics in Flink and are then sent "one or more times" to the outside world (for example a database). If that database simply overwrites/replaces values for keys (think upsert operation), this gives end-to-end exactly-once semantics.

Greetings,
Stephan


On Wed, Jan 13, 2016 at 1:31 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Francis,

A part of every complete snapshot is the record positions associated with
the barrier that triggered the checkpointing of this snapshot. The snapshot
is completed only when all the records within the checkpoint reaches the
sink. When a topology fails, all the operators' state will fall back to the
latest complete snapshot (incomplete snapshots will be ignored). The data
source will also fall back to the position recorded with this snapshot, so
even if there are repeatedly read data records after the restore, the
restored operator's state are also clean of the records effect. This way,
Flink guarantees exactly-once effects of each record on every operator's
state. The user functions in operators need not to be implemented
idempotent.

Hope this helps answer your question!

Cheers,
Gordon



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-for-exact-once-proccessing-tp4261p4264.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.