Hello,
I'm trying to understand the process of checkpoint processing for exact-once in Flink, and I have some doubts. The documentation says that when there is a failure and the state of an operator is restored, the already processed records are deleted based on their identifiers. My doubts is, how these identifiers between two checkpoints are maintained? Every time a new input record comes to the stateful operator, Flink persists it before making the checkpoint? Otherwise, there may be messages to reprocess after a failure. Thanks !!! |
Hi! I think there is a misunderstanding. There are no identifiers maintained and no individual records deleted. On recovery, all operators reset their state to a consistent snapshot: https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/stream_checkpointing.html Greetings, Stephan On Wed, Jan 13, 2016 at 11:08 AM, Don Frascuchon <[hidden email]> wrote:
|
Hi Stephan, So, consider an operator task with two processed records and no barrier incoming. If the task fail and must be records, the last consistent snapshot will be used, which no includes information about the processed but no checkpointed records. What about this situation? The registers will be resent to failed task after, or will be discarded? How flink manage information about this records for exact-once guarantees? The user function inside operator must be idempotent (i think about some kind of persistence in a sink task) Thanks in advance ! El mié., 13 ene. 2016 a las 11:17, Stephan Ewen (<[hidden email]>) escribió:
|
Hi Francis,
A part of every complete snapshot is the record positions associated with the barrier that triggered the checkpointing of this snapshot. The snapshot is completed only when all the records within the checkpoint reaches the sink. When a topology fails, all the operators' state will fall back to the latest complete snapshot (incomplete snapshots will be ignored). The data source will also fall back to the position recorded with this snapshot, so even if there are repeatedly read data records after the restore, the restored operator's state are also clean of the records effect. This way, Flink guarantees exactly-once effects of each record on every operator's state. The user functions in operators need not to be implemented idempotent. Hope this helps answer your question! Cheers, Gordon |
Thanks, Gordon, for the nice answer! One thing is important to add: Exactly-once refers to state maintained by Flink. All side effects (changes made to the "outside" world), which includes sinks, need in fact to be idempotent, or will only have "at-least once" semantics. In practice, this works often very well, because results can be computes with "exactly once" semantics in Flink and are then sent "one or more times" to the outside world (for example a database). If that database simply overwrites/replaces values for keys (think upsert operation), this gives end-to-end exactly-once semantics. Greetings, Stephan On Wed, Jan 13, 2016 at 1:31 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote: Hi Francis, |
Thanks to both !!.
That's help me to understand the recovery process El mié., 13 ene. 2016 a las 14:01, Stephan Ewen (<[hidden email]>) escribió:
|
Free forum by Nabble | Edit this page |