Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster
Posted by
Nico Kruber on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301p13581.html
Hi Amara,
please refer to [1] for some details about our checkpointing mechanism, in
short, for your situation:
* checkpoints are made at certain checkpoint barriers,
* in between those barriers, processing continues and so do outputs
* in case of a failure the state at the latest checkpoint is restored
* then the processing re-starts from there and you will see the same outputs
again
You seem to not deliver to Kafka but only consume from it and write to a csv
file. If this operation was transactional, you would commit at each checkpoint
barrier only and never see the "duplicate", i.e. uncommitted events.
Regards,
Nico
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/stream_checkpointing.html
On Monday, 5 June 2017 08:55:05 CEST F.Amara wrote:
> Hi Robert,
>
> I have few more questions to clarify.
>
> 1) Why do you say printing the values to the standard out would display
> duplicates even if exactly once works? What is the reason for this? Could
> you brief me on this?
>
> 2) I observed duplicates (by writing to a file) starting from the
> FlinkKafkaConsumer onwards. Why does this component introduce duplicates? Is
> it because Kafka guarantees only At-least once delivery at the moment?
>
> Thanks,
> Amara
>
>
>
> --
> View this message in context:
>
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Duplica> ted-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-clu
> ster-tp13301p13483.html Sent from the Apache Flink User Mailing List
> archive. mailing list archive at Nabble.com.