Question about Checkpoint

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about Checkpoint

ZalaCheung
Hi Flink Community,

I am new to Flink and now looking at checkpoint of Flink.

After reading the document, I am still confused. Here is scene:

I have a datastream finally flow to a database sink. I will update one of the field in database based on the incomming stream. I have now complete a snapshot, say t1, and snapshot t2 is on progress. After snapshot t1 complete, I update my database several times.  Then BEFORE snapshot t2 complete, my task fail. 

Based on my understanding of checkpoint on Flink, I will recover from snapshot t1 and re-execute the Streams between t1 and t2. But I've already update my database several times after t1. Does flink deal with this issue?


Best,
Desheng Zhang
E-mail: [hidden email];

Reply | Threaded
Open this post in threaded view
|

Re: Question about Checkpoint

Tzu-Li (Gordon) Tai
Hi Desheng,

Welcome to the community!

What you’re asking alludes the question: How does Flink support end-to-end (from external source to external sink, e.g. Kafka to database) exactly-once delivery?
Whether or not that is supported depends on the guarantees of the source and sink and how they work with Flink’s checkpointing; see the overview of the guarantees here [1].

I think you already understand how sources work with checkpoints for exactly-once.
For sinks to be exactly-once, the external system needs to be able to participate in the checkpointing mechanism, so it depends on what the sink is.
The participation is usually in some form of transaction, which is only committed once a job's checkpoint is fully completed.
If the sink doesn’t support this, you can also achieve effective end-to-end exactly-once by making your database writes / updates idempotent, so that replaying the stream since t1 does not affect the end results.

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/guarantees.html


On 27 June 2017 at 11:36:00 AM, ZalaCheung ([hidden email]) wrote:

Hi Flink Community,

I am new to Flink and now looking at checkpoint of Flink.

After reading the document, I am still confused. Here is scene:

I have a datastream finally flow to a database sink. I will update one of the field in database based on the incomming stream. I have now complete a snapshot, say t1, and snapshot t2 is on progress. After snapshot t1 complete, I update my database several times.  Then BEFORE snapshot t2 complete, my task fail. 

Based on my understanding of checkpoint on Flink, I will recover from snapshot t1 and re-execute the Streams between t1 and t2. But I've already update my database several times after t1. Does flink deal with this issue?


Best,
Desheng Zhang
E-mail: [hidden email];