Data duplication on a High Availability activated cluster after a Task Manager failure recovery

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data duplication on a High Availability activated cluster after a Task Manager failure recovery

F.Amara
Hi all,

I'm using Flink 1.2.0. I have a distributed system where Flink High Availability feature is activated. Data is produced using a Kafka broker and on a TM failure scenario, the cluster restarts. Checkpointing is enabled with exactly once processing.
Problem encountered is, at the end of data processing I receive duplicated data and some data are also missing. (ex: if 2000 events are sent it loses around 800 events and some events are duplicated at the receiving end).

Is this an issue with the Flink version or would it be an issue from my program logic?
Reply | Threaded
Open this post in threaded view
|

Re: Data duplication on a High Availability activated cluster after a Task Manager failure recovery

Tzu-Li (Gordon) Tai
Hi,

A few things to clarify first:

1. What is the sink you are using? Checkpointing in Flink allows for exactly-once state updates. Whether or not end-to-end exactly-once delivery can be achieved depends on the sink. For data store sinks such as Cassandra / Elasticsearch, this can be made effectively exactly-once using idempotent writes (depending on the application logic). For a Kafka topic as a sink, currently the delivery is only at-least-once. You can check out [1] for an overview.

2. Also note that if there essentially is already duplicates in the consumed Kafka topic (which may occur since Kafka producing does not support any kind of transactions at the moment), then they will all be consumed and processed by Flink.

However, this does not explain missing data, as this should not happen.
So for this, yes, I would try to check if there’s an issue with the application logic or the events simply were not in the consumed Kafka topic in the first place.

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html


On 17 April 2017 at 12:14:00 PM, F.Amara ([hidden email]) wrote:

Hi all,

I'm using Flink 1.2.0. I have a distributed system where Flink High
Availability feature is activated. Data is produced using a Kafka broker and
on a TM failure scenario, the cluster restarts. Checkpointing is enabled
with exactly once processing.
Problem encountered is, at the end of data processing I receive duplicated
data and some data are also missing. (ex: if 2000 events are sent it loses
around 800 events and some events are duplicated at the receiving end).

Is this an issue with the Flink version or would it be an issue from my
program logic?



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-duplication-on-a-High-Availability-activated-cluster-after-a-Task-Manager-failure-recovery-tp12627.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Data duplication on a High Availability activated cluster after a Task Manager failure recovery

F.Amara
This post was updated on .
Hi Gordon,

Appreciate your prompt reply. Thanks alot for pointing that out that Kafka Producer has at least once guarantee of message delivery. That seems to be the reason why I encountered duplicated data on a flink failure recovery scenario.