Flink to get historical data from kafka between timespan t1 & t2

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR

Hi Flink Community Team,

 

This is a desperate request for your help on below.

 

I am new to the Flink and trying to use it with Kafka for Event-based data stream processing in my project. I am struggling using Flink to find solutions to my requirements of project below:

 

  1. Get all Kafka topic records at a given time point ‘t’ (now or in the past). Also how to pull latest-record only* from Kafka using Flink
  2. Getting all records from Kafka for a given time interval in the past between t1 & t2 time period.
  3. Continuously getting data from Kafka starting at a given time point (now or in the past). The client will actively cancel/close the data streaming. Examples: live dashboards. How to do it using Flink?

Please provide me sample “Flink code snippet” for pulling data from kafka for above three requirements and oblige. I am stuck for last one month without much progress and your timely help will be a savior for me!

Thanks & Regards,

Vinay Raichur

T-Systems India | Digital Solutions

Mail: [hidden email]

Mobile: +91 9739488992

 

Reply | Threaded
Open this post in threaded view
|

Re: Flink to get historical data from kafka between timespan t1 & t2

Aljoscha Krettek
Hi,

for your point 3. you can look at
`FlinkKafkaConsumerBase.setStartFromTimestamp(...)`.

Points 1. and 2. will not work with the well established
`FlinkKafkaConsumer`. However, it should be possible to do it with the
new `KafkaSource` that was introduced in Flink 1.12. It might be a bit
rough around the edged, though.

With the `KafkaSource` you can specify `OffsetInitializers` for both the
starting and stopping offset of the source. Take a look at
`KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` in the
code.

I hope this helps.

Best,
Aljoscha

On 2021/01/08 07:51, [hidden email] wrote:

>Hi Flink Community Team,
>
>This is a desperate request for your help on below.
>
>I am new to the Flink and trying to use it with Kafka for Event-based data stream processing in my project. I am struggling using Flink to find solutions to my requirements of project below:
>
>
>  1.  Get all Kafka topic records at a given time point 't' (now or in the past). Also how to pull latest-record only* from Kafka using Flink
>  2.  Getting all records from Kafka for a given time interval in the past between t1 & t2 time period.
>  3.  Continuously getting data from Kafka starting at a given time point (now or in the past). The client will actively cancel/close the data streaming. Examples: live dashboards. How to do it using Flink?
>Please provide me sample "Flink code snippet" for pulling data from kafka for above three requirements and oblige. I am stuck for last one month without much progress and your timely help will be a savior for me!
>Thanks & Regards,
>Vinay Raichur
>T-Systems India<https://www.t-systems.com/in/en> | Digital Solutions
>Mail: [hidden email]<mailto:[hidden email]>
>Mobile: +91 9739488992
>
Reply | Threaded
Open this post in threaded view
|

RE: Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR
Thanks Aljoscha for your prompt response. It means a lot to me 😊

Could you also attach the code snippet for KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` that you were referring to in your previous reply, for my reference please to make it more clearer for me.

Kind regards,
Vinay

-----Original Message-----
From: Aljoscha Krettek <[hidden email]>
Sent: 08 January 2021 19:26
To: [hidden email]
Subject: Re: Flink to get historical data from kafka between timespan t1 & t2

Hi,

for your point 3. you can look at
`FlinkKafkaConsumerBase.setStartFromTimestamp(...)`.

Points 1. and 2. will not work with the well established `FlinkKafkaConsumer`. However, it should be possible to do it with the new `KafkaSource` that was introduced in Flink 1.12. It might be a bit rough around the edged, though.

With the `KafkaSource` you can specify `OffsetInitializers` for both the starting and stopping offset of the source. Take a look at `KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` in the code.

I hope this helps.

Best,
Aljoscha

On 2021/01/08 07:51, [hidden email] wrote:

>Hi Flink Community Team,
>
>This is a desperate request for your help on below.
>
>I am new to the Flink and trying to use it with Kafka for Event-based data stream processing in my project. I am struggling using Flink to find solutions to my requirements of project below:
>
>
>  1.  Get all Kafka topic records at a given time point 't' (now or in
>the past). Also how to pull latest-record only* from Kafka using Flink
>  2.  Getting all records from Kafka for a given time interval in the past between t1 & t2 time period.
>  3.  Continuously getting data from Kafka starting at a given time point (now or in the past). The client will actively cancel/close the data streaming. Examples: live dashboards. How to do it using Flink?
>Please provide me sample "Flink code snippet" for pulling data from kafka for above three requirements and oblige. I am stuck for last one month without much progress and your timely help will be a savior for me!
>Thanks & Regards,
>Vinay Raichur
>T-Systems India<https://www.t-systems.com/in/en> | Digital Solutions
>Mail: [hidden email]<mailto:[hidden email]>
>Mobile: +91 9739488992
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink to get historical data from kafka between timespan t1 & t2

Aljoscha Krettek
On 2021/01/08 16:55, [hidden email] wrote:
>Could you also attach the code snippet for KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` that you were referring to in your previous reply, for my reference please to make it more clearer for me.

Ah sorry, but this I was referring to the Flink code. You can start with
`KafkaSource` [1], which has an example block that shows how to use it
in the Javadocs.

[1]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/KafkaSource.java
Reply | Threaded
Open this post in threaded view
|

RE: Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR
Hi Aljoscha,
Currently in our set-up we are using docker container version of Flink:1.11.2 along with containerized version of Kafka running on our Kubernetes cluster as seen in attached screenshot .png file to this mail of our Vagrant VM.

a) As mentioned by you "KafkaSource" was introduced in Flink 1.12 so, I suppose we have to upgrade to this version of Flink. Can you share the link of the stable Flink image (containerized version) to be used in our set-up keeping in mind we are to use   KafkaSource.

b) Also kindly share details of the corresponding compatible version of "Kafka" (containerized version) to be used for making this work as I understand there is constraint on version of Kafka required to make it work.

Kind regards,
Vinay

-----Original Message-----
From: Aljoscha Krettek <[hidden email]>
Sent: 11 January 2021 16:06
To: [hidden email]
Cc: RAICHUR, VINAY <[hidden email]>
Subject: Re: Flink to get historical data from kafka between timespan t1 & t2

On 2021/01/08 16:55, [hidden email] wrote:
>Could you also attach the code snippet for KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` that you were referring to in your previous reply, for my reference please to make it more clearer for me.

Ah sorry, but this I was referring to the Flink code. You can start with `KafkaSource` [1], which has an example block that shows how to use it in the Javadocs.

[1]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/KafkaSource.java

Flink-KafkaService-up&running.png (158K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink to get historical data from kafka between timespan t1 & t2

Aljoscha Krettek
On 2021/01/11 14:12, [hidden email] wrote:
>a) As mentioned by you "KafkaSource" was introduced in Flink 1.12 so, I
>suppose we have to upgrade to this version of Flink. Can you share the
>link of the stable Flink image (containerized version) to be used in
>our set-up keeping in mind we are to use KafkaSource.

Unfortunately, there is a problem with publishing those images. You can
check out [1] to follow progress. In the meantime you can try and build
the image yourself or use the one that Robert pushed to his private
account.

>b) Also kindly share details of the corresponding compatible version of
>"Kafka" (containerized version) to be used for making this work as I
>understand there is constraint on version of Kafka required to make it
>work.

I believe Kafka is backwards compatible since at least version 1.0, so
any recent enough version should work.

Best,
Aljoscha
Reply | Threaded
Open this post in threaded view
|

RE: Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR
In reply to this post by VINAY.RAICHUR
Hi Team & Aljoscha

Any update on below please?

-----Original Message-----
From: RAICHUR, VINAY
Sent: 11 January 2021 19:43
To: 'Aljoscha Krettek' <[hidden email]>; [hidden email]
Subject: RE: Flink to get historical data from kafka between timespan t1 & t2
Importance: High

Hi Aljoscha,
Currently in our set-up we are using docker container version of Flink:1.11.2 along with containerized version of Kafka running on our Kubernetes cluster as seen in attached screenshot .png file to this mail of our Vagrant VM.

a) As mentioned by you "KafkaSource" was introduced in Flink 1.12 so, I suppose we have to upgrade to this version of Flink. Can you share the link of the stable Flink image (containerized version) to be used in our set-up keeping in mind we are to use   KafkaSource.

b) Also kindly share details of the corresponding compatible version of "Kafka" (containerized version) to be used for making this work as I understand there is constraint on version of Kafka required to make it work.

Kind regards,
Vinay

-----Original Message-----
From: Aljoscha Krettek <[hidden email]>
Sent: 11 January 2021 16:06
To: [hidden email]
Cc: RAICHUR, VINAY <[hidden email]>
Subject: Re: Flink to get historical data from kafka between timespan t1 & t2

On 2021/01/08 16:55, [hidden email] wrote:
>Could you also attach the code snippet for KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` that you were referring to in your previous reply, for my reference please to make it more clearer for me.

Ah sorry, but this I was referring to the Flink code. You can start with `KafkaSource` [1], which has an example block that shows how to use it in the Javadocs.

[1]
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/KafkaSource.java
Reply | Threaded
Open this post in threaded view
|

RE: Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR
In reply to this post by Aljoscha Krettek
Hi Aljoscha

Not sure about your proposal regarding Point 3:
* firstly how is it ensured that the stream is closed? If I understand the doc correctly the stream will be established starting with the latest timestamp (hmm... is it not a standard behaviour?) and will never finish (UNBOUNDED),
* secondly it is still not clear how to get the latest event  at a given time point in the past?

Please help me with answers for the above

Thanks & regards
Vinay

-----Original Message-----
From: Aljoscha Krettek <[hidden email]>
Sent: 08 January 2021 19:26
To: [hidden email]
Subject: Re: Flink to get historical data from kafka between timespan t1 & t2

Hi,
for your point 3. you can look at
`FlinkKafkaConsumerBase.setStartFromTimestamp(...)`.

Points 1. and 2. will not work with the well established `FlinkKafkaConsumer`. However, it should be possible to do it with the new `KafkaSource` that was introduced in Flink 1.12. It might be a bit rough around the edged, though.

With the `KafkaSource` you can specify `OffsetInitializers` for both the starting and stopping offset of the source. Take a look at `KafkaSource`, `KafkaSourceBuilder`, and `OffsetInitializers` in the code.

I hope this helps.

Best,
Aljoscha

On 2021/01/08 07:51, [hidden email] wrote:

>Hi Flink Community Team,
>
>This is a desperate request for your help on below.
>
>I am new to the Flink and trying to use it with Kafka for Event-based data stream processing in my project. I am struggling using Flink to find solutions to my requirements of project below:
>
>
>  1.  Get all Kafka topic records at a given time point 't' (now or in
>the past). Also how to pull latest-record only* from Kafka using Flink
>  2.  Getting all records from Kafka for a given time interval in the past between t1 & t2 time period.
>  3.  Continuously getting data from Kafka starting at a given time point (now or in the past). The client will actively cancel/close the data streaming. Examples: live dashboards. How to do it using Flink?
>Please provide me sample "Flink code snippet" for pulling data from kafka for above three requirements and oblige. I am stuck for last one month without much progress and your timely help will be a savior for me!
>Thanks & Regards,
>Vinay Raichur
>T-Systems India<https://www.t-systems.com/in/en> | Digital Solutions
>Mail: [hidden email]<mailto:[hidden email]>
>Mobile: +91 9739488992
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink to get historical data from kafka between timespan t1 & t2

Aljoscha Krettek
On 2021/01/13 07:58, [hidden email] wrote:
>Not sure about your proposal regarding Point 3:
>* firstly how is it ensured that the stream is closed? If I understand
>the doc correctly the stream will be established starting with the
>latest timestamp (hmm... is it not a standard behaviour?) and will
>never finish (UNBOUNDED),

On the first question of standard behaviour: the default is to start
from the group offsets that are available in Kafka. This uses the
configured consumer group. I think it's better to be explicit, though,
and specify sth like `EARLIEST` or `LATEST`, etc.

And yes, the stream will start but never stop with this version of the
Kafka connector. Only when you use the new `KafkaSource` can you also
specify an end timestamp that will make the Kafka source shut down
eventually.

>* secondly it is still not clear how to get the latest event  at a
>given time point in the past?

You are referring to getting a single record, correct? I don't think
this is possible with Flink. All you can do is get a stream from Kafka
that is potentially bounded by a start timestamp and/or end timestamp.

Best,
Aljoscha
Reply | Threaded
Open this post in threaded view
|

RE: Flink to get historical data from kafka between timespan t1 & t2

VINAY.RAICHUR
Ok. Attached is the PPT of what am attempting to achieve w.r.t. time

Hope I am all set to achieve the three bullets mentioned in attached slide to create reports with KafkaSource and KafkaBuilder approach.

If you have any additional tips to share please do so after going through the slide attached (example for live dashboards use case)

Kind regards
Vinay

-----Original Message-----
From: Aljoscha Krettek <[hidden email]>
Sent: 13 January 2021 14:06
To: [hidden email]
Cc: RAICHUR, VINAY <[hidden email]>
Subject: Re: Flink to get historical data from kafka between timespan t1 & t2

On 2021/01/13 07:58, [hidden email] wrote:
>Not sure about your proposal regarding Point 3:
>* firstly how is it ensured that the stream is closed? If I understand
>the doc correctly the stream will be established starting with the
>latest timestamp (hmm... is it not a standard behaviour?) and will
>never finish (UNBOUNDED),

On the first question of standard behaviour: the default is to start from the group offsets that are available in Kafka. This uses the configured consumer group. I think it's better to be explicit, though, and specify sth like `EARLIEST` or `LATEST`, etc.

And yes, the stream will start but never stop with this version of the Kafka connector. Only when you use the new `KafkaSource` can you also specify an end timestamp that will make the Kafka source shut down eventually.

>* secondly it is still not clear how to get the latest event  at a
>given time point in the past?

You are referring to getting a single record, correct? I don't think this is possible with Flink. All you can do is get a stream from Kafka that is potentially bounded by a start timestamp and/or end timestamp.

Best,
Aljoscha

Positioning_Use_Cases_TrackingData_Past_Now.pptx (125K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink to get historical data from kafka between timespan t1 & t2

Aljoscha Krettek
On 2021/01/13 12:07, [hidden email] wrote:
>Ok. Attached is the PPT of what am attempting to achieve w.r.t. time
>
>Hope I am all set to achieve the three bullets mentioned in attached
>slide to create reports with KafkaSource and KafkaBuilder approach.
>
>If you have any additional tips to share please do so after going
>through the slide attached (example for live dashboards use case)

I think that should work with `KafkaSource`, yes. You just need to set
the correct start timestamps and end timestamps, respectively. I believe
that's all there is to it, off the top of my head I can't think of any
additional tips.

Best,
Aljoscha