Reading bounded data from Kafka in Flink job

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading bounded data from Kafka in Flink job

Marchant, Hayden
I have 2 datasets that I need to join together in a Flink batch job. One of the datasets needs to be created dynamically by completely 'draining' a Kafka topic in an offset range (start and end), and create a file containing all messages in that range. I know that in Flink streaming I can specify the start offset, but not the end offset. In my case, this preparation of the file from kafka topic is really working on a finite, bounded set of data, even though it's from Kafka.

Is there a way that I can do this in Flink (either streaming or batch ?

Thanks,
Hayden
       


Reply | Threaded
Open this post in threaded view
|

Re: Reading bounded data from Kafka in Flink job

Fabian Hueske-2
Hi Hayden,

as far as I know, an end offset is not supported by Flink's Kafka consumer.
You could extend Flink's consumer. As you said, there is already code to set the starting offset (per partition), so you might be able to just piggyback on that.

Gordon (in CC) who has worked a lot on the Kafka connector might have a better idea.

Best, Fabian

2018-02-01 11:42 GMT+01:00 Marchant, Hayden <[hidden email]>:
I have 2 datasets that I need to join together in a Flink batch job. One of the datasets needs to be created dynamically by completely 'draining' a Kafka topic in an offset range (start and end), and create a file containing all messages in that range. I know that in Flink streaming I can specify the start offset, but not the end offset. In my case, this preparation of the file from kafka topic is really working on a finite, bounded set of data, even though it's from Kafka.

Is there a way that I can do this in Flink (either streaming or batch ?

Thanks,
Hayden