Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Posted by F.Amara on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301.html

Hi all,

I'm working with Flink 1.2.0, Kafka 0.10.0.1 and Hadoop 2.7.3.

I have a Flink Highly Available cluster that reads data from a Kafka producer and processes them within the cluster. I randomly kill a Task Manager to introduce failure. Restart strategy is configured and the cluster does restart processing after a slight delay which is expected.
But when I check the output after the final processing is done, I see duplicates (when sending 4200 events with a 40ms delay between them observed 56 duplicates). As mentioned in [1] I have configured ExternalizedCheckpoints but still do observe duplicates.

Even when I tested (cancelled job and restarted) using manual savepoints 2 or 3 duplicates appeared!

Can someone explain how I can use the savepoint created through ExternalizedCheckpoints to make the application start processing exactly from where it left? I need the application to automatically read the savepoint details and recover from that point onwards rather than doing it manually.
Or else is the usual Savepoints capable of serving the same functionality automatically?

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html 

Thanks,
Amara