Checkpoints and event ordering

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoints and event ordering

shikhar
Flinkheads,

I'm processing from a Kafka source, using event time with watermarks based on a threshold, and using tumbling time windows to perform some rollup. My sink is idempotent, and I want to ensure exactly-once processing end-to-end.

I am trying to figure out if I can stick with memory checkpointing, and not bother with a checkpointing state backend or savepoints for job redeploys. It'd be great if I can just rely on the Kafka consumer's offset persistence to Zookeeper for that 'group.id' - I see that it saves the relevant offset to Zookeeper when a checkpoint has been triggered.

However I'm concerned whether there is potential for dropping events if I stick with the memory checkpoints.

The documentation talks of a checkpoint being triggered when the relevant barrier has made it all the way to the sink -- how does that interact with windowed streams, where some events might get buffered while later ones make it through?

More concretely, when the Kafka consumer persists an offset to Zookeeper based on receiving a checkpoint trigger, can I trust that all events from before that offset are not held in any windowing intermediate state?

Thanks!

Shikhar
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints and event ordering

Till Rohrmann
Hi Shikhar,

the currently open windows are also part of the operator state. Whenever a window operator receives a barrier it will checkpoint the state of the user function and additionally all uncompleted windows. This also means that the window operator does not buffer the barriers. Once it has taken the snapshot the barrier is sent to the downstream operators.

Thus, to answer your question, you cannot say that all events from before the barrier are no longer held in any windowing intermediate state once a checkpoint is completed. However, this does not matter because the windowing intermediate state is also checkpointed so that you won't lose elements.

Cheers,
Till

On Thu, Feb 4, 2016 at 6:39 AM, shikhar <[hidden email]> wrote:
Flinkheads,

I'm processing from a Kafka source, using event time with watermarks based
on a threshold, and using tumbling time windows to perform some rollup. My
sink is idempotent, and I want to ensure exactly-once processing end-to-end.

I am trying to figure out if I can stick with memory checkpointing, and not
bother with a checkpointing state backend or savepoints for job redeploys.
It'd be great if I can just rely on the Kafka consumer's offset persistence
to Zookeeper for that 'group.id' - I see that it saves the relevant offset
to Zookeeper when a checkpoint has been triggered.

However I'm concerned whether there is potential for dropping events if I
stick with the memory checkpoints.

The documentation talks of a checkpoint being triggered when the relevant
barrier has made it all the way to the sink -- how does that interact with
windowed streams, where some events might get buffered while later ones make
it through?

More concretely, when the Kafka consumer persists an offset to Zookeeper
based on receiving a checkpoint trigger, can I trust that all events from
before that offset are not held in any windowing intermediate state?

Thanks!

Shikhar



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoints-and-event-ordering-tp4664.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.