Hi,
I have just started working with FLINK and I am working on a project which involves reading KAFKA data and processing it. Following questions came to my mind: 1. Will FLink-Kafka consumer 0.8x run on multiple task slots or a single task slot? Basically I mean if its going to be a parallel operation or a non parallel operation? 2. If its a parallel operation, then do multiple task slots read data from single kafka partition or multiple kafka partition? 3. If data is read from multiple Kafka partition, then how duplication is avoided? Is it done from KAFKA or by FLINK? Rahul Raj |
Hi Rahul!
Each single parallel instance of a FlinkKafkaConsumer source can subscribe to multiple Kafka partitions. Each Kafka partition is handled by exactly one FlinkKafkaConsumer parallel instance.
Yes, the FlinkKafkaConsumer is a parallel consumer.I’m not sure exactly what you are referring to by “duplication” here. Do you mean duplication in the data itself in the Kafka topics, or duplicated consumption by Flink? If it is the former: prior to Kafka 0.11, Kafka writes did not support transactions and therefore can only have at-least-once writes. If you mean the latter: the FlinkKafkaConsumer achieves exactly-once guarantees when consuming from Kafka topics using Flink’s checkpointing mechanism. You can read about that here [1][2]. Hope the pointers help! - Gordon [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/stream_checkpointing.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance On 22 September 2017 at 10:46:55 AM, Rahul Raj ([hidden email]) wrote:
|
Free forum by Nabble | Edit this page |