Flinkheads,
I'm processing from a Kafka source, using event time with watermarks based on a threshold, and using tumbling time windows to perform some rollup. My sink is idempotent, and I want to ensure exactly-once processing end-to-end. I am trying to figure out if I can stick with memory checkpointing, and not bother with a checkpointing state backend or savepoints for job redeploys. It'd be great if I can just rely on the Kafka consumer's offset persistence to Zookeeper for that 'group.id' - I see that it saves the relevant offset to Zookeeper when a checkpoint has been triggered. However I'm concerned whether there is potential for dropping events if I stick with the memory checkpoints. The documentation talks of a checkpoint being triggered when the relevant barrier has made it all the way to the sink -- how does that interact with windowed streams, where some events might get buffered while later ones make it through? More concretely, when the Kafka consumer persists an offset to Zookeeper based on receiving a checkpoint trigger, can I trust that all events from before that offset are not held in any windowing intermediate state? Thanks! Shikhar |
Hi Shikhar, the currently open windows are also part of the operator state. Whenever a window operator receives a barrier it will checkpoint the state of the user function and additionally all uncompleted windows. This also means that the window operator does not buffer the barriers. Once it has taken the snapshot the barrier is sent to the downstream operators. Thus, to answer your question, you cannot say that all events from before the barrier are no longer held in any windowing intermediate state once a checkpoint is completed. However, this does not matter because the windowing intermediate state is also checkpointed so that you won't lose elements. Cheers, Till On Thu, Feb 4, 2016 at 6:39 AM, shikhar <[hidden email]> wrote: Flinkheads, |
Free forum by Nabble | Edit this page |