Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Posted by rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301p13433.html

HiĀ Amara,
how are you validating if you have duplicates in your output or not?

If you are just writing the output to another Kafka topic or print it to standard out, you'll see duplicates even if exactly once works.
Flink does not provide exactly once delivery. Flink has exactly once semantics for registered state.

This means you need to cooperate with the system to achieve exactly once. For example for files, you need to remove invalid data from previous failed checkpoints. Our bucketing sink is doing that.


On Tue, May 30, 2017 at 9:01 AM, F.Amara <[hidden email]> wrote:
Hi Gordan,

Thanks alot for the reply.
The events are produced using a KafkaProducer, submitted to a topic and
thereby consumed by the Flink application using a FlinkKafkaConsumer. I
verified that during a failure recovery scenario(of the Flink application)
the KafkaProducer was not interrupted, resulting in not sending duplicated
values from the data source. I observed the output from the
FlinkKafkaConsumer and noticed duplicates starting from that point onwards.
Is the FlinkKafkaConsumer capable of intoducing duplicates?

How can I implement exactly-once processing for my application? Could you
please guide me on what I might have missed?

Thanks,
Amara




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301p13379.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.