Deduplicate messages from Kafka topic

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Deduplicate messages from Kafka topic

ljwagerfield
As I understand it, the Flink Kafka Producer may emit duplicates to Kafka topics.

How can I deduplicate these messages when reading them back with Flink (via the Flink Kafka Consumer)?

For example, is there any out-the-box support for deduplicating messages, i.e. by incorporating something like "idempotent producers" as proposed by Jay Krepps (which, as I understand it, involves maintaining a "high watermark" on a message-by-message level)?
Reply | Threaded
Open this post in threaded view
|

Re: Deduplicate messages from Kafka topic

Tzu-Li (Gordon) Tai
Hi,

You’re correct that the FlinkKafkaProducer may emit duplicates to Kafka topics, as it currently only provides at-least-once guarantees.
Note that this isn’t a restriction only in the FlinkKafkaProducer, but a general restriction for Kafka's message delivery.
This can definitely be improved to exactly-once (no duplicates produced into topics) once Kafka supports transactional messaging.

On the consumer side, the FlinkKafkaConsumer doesn’t have built-in support to dedupe the messages read from topics.
On the other hand this isn’t really feasible, as consumers could basically only view messages with different offsets as separate independent messages, unless identified by some user application-level logic.
So in the end, we’ll need to rely on the assumption that messages produced into Kafka topics are not duplicated, which as explained above, will hopefully be available in the near future.

Cheers,
Gordon

On January 14, 2017 at 6:12:29 PM, ljwagerfield ([hidden email]) wrote:

As I understand it, the Flink Kafka Producer may emit duplicates to Kafka
topics.

How can I deduplicate these messages when reading them back with Flink (via
the Flink Kafka Consumer)?

For example, is there any out-the-box support for deduplicating messages,
i.e. by incorporating something like "idempotent producers" as proposed by
Jay Krepps (which, as I understand it, involves maintaining a "high
watermark" on a message-by-message level)?



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Deduplicate-messages-from-Kafka-topic-tp11051.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.