Hi,
A few things to clarify first:
1. What is the sink you are using? Checkpointing in Flink allows for exactly-once state updates. Whether or not end-to-end exactly-once delivery can be achieved depends on the sink. For data store sinks such as Cassandra / Elasticsearch, this can be made effectively exactly-once using idempotent writes (depending on the application logic). For a Kafka topic as a sink, currently the delivery is only at-least-once. You can check out [1] for an overview.
2. Also note that if there essentially is already duplicates in the consumed Kafka topic (which may occur since Kafka producing does not support any kind of transactions at the moment), then they will all be consumed and processed by Flink.
However, this does not explain missing data, as this should not happen.
So for this, yes, I would try to check if there’s an issue with the application logic or the events simply were not in the consumed Kafka topic in the first place.
Cheers,
Gordon
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html On 17 April 2017 at 12:14:00 PM, F.Amara ([hidden email]) wrote:
Hi all,
I'm using Flink 1.2.0. I have a distributed system where Flink High
Availability feature is activated. Data is produced using a Kafka broker and
on a TM failure scenario, the cluster restarts. Checkpointing is enabled
with exactly once processing.
Problem encountered is, at the end of data processing I receive duplicated
data and some data are also missing. (ex: if 2000 events are sent it loses
around 800 events and some events are duplicated at the receiving end).
Is this an issue with the Flink version or would it be an issue from my
program logic?
--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-duplication-on-a-High-Availability-activated-cluster-after-a-Task-Manager-failure-recovery-tp12627.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.