Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Posted by F.Amara on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301p13481.html

Hi Robert,

Thanks a lot for the reply.

To further explain how I verify the presence of duplicates, I write the output stream received at the FlinkKafkaConsumer (after being sent from the KafkaProducer) to a csv file.
Then the content of the file is scanned to see whether we received the exact amount of events sent from the KafkaProducer and then look for values that have appeared more than once indicating duplicates.
In my case the total number of events received is always higher than what we sent.  

The following diagram explains the procedure.

|----------------------------------|       |-------------------|        |---------------------------------|
|      KafkaProducer         |-------->|      Kafka     |------>|  FlinkKafkaConsumer  |
|(A separate Java process|          |                       |      |   (Starting point of        |
|  which generates data     |          |                       |      |   Flink application)        |
|  and writes to Kafka)       |           |                       |      |                                          |
|----------------------------------|           |-------------------|      |------------------------------------|


Thanks,
Amara