Access to Kafka Event Time

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Access to Kafka Event Time

Vishal Santoshi
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal






Reply | Threaded
Open this post in threaded view
|

Re: Access to Kafka Event Time

Vishal Santoshi
In fact it may be available else where too ( for example ProcessFunction etc ) but do we have no need to create one, it is just a data relay ( kafka to hdfs ) and any intermediate processing should be avoided if possible IMHO.

On Tue, Jul 31, 2018 at 9:10 AM, Vishal Santoshi <[hidden email]> wrote:
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal







Reply | Threaded
Open this post in threaded view
|

Re: Access to Kafka Event Time

Vishal Santoshi
Any feedbaxk?

On Tue, Jul 31, 2018, 10:20 AM Vishal Santoshi <[hidden email]> wrote:
In fact it may be available else where too ( for example ProcessFunction etc ) but do we have no need to create one, it is just a data relay ( kafka to hdfs ) and any intermediate processing should be avoided if possible IMHO.

On Tue, Jul 31, 2018 at 9:10 AM, Vishal Santoshi <[hidden email]> wrote:
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal







Reply | Threaded
Open this post in threaded view
|

Re: Access to Kafka Event Time

Hequn Cheng
Hi Vishal,

We have a use case where multiple topics are streamed to hdfs and we would want to created buckets based on ingestion time 
If I understand correctly, you want to create buckets based on event time. Maybe you can use window[1]. For example, a tumbling window of 5 minutes groups rows in 5 minutes intervals. And you can get window start time(TUMBLE_START(time_attr, interval)) and end time(TUMBLE_END(time_attr, interval)) when output data.

Best, Hequn


On Wed, Aug 1, 2018 at 8:21 PM, Vishal Santoshi <[hidden email]> wrote:
Any feedbaxk?

On Tue, Jul 31, 2018, 10:20 AM Vishal Santoshi <[hidden email]> wrote:
In fact it may be available else where too ( for example ProcessFunction etc ) but do we have no need to create one, it is just a data relay ( kafka to hdfs ) and any intermediate processing should be avoided if possible IMHO.

On Tue, Jul 31, 2018 at 9:10 AM, Vishal Santoshi <[hidden email]> wrote:
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal








Reply | Threaded
Open this post in threaded view
|

Re: Access to Kafka Event Time

Aljoscha Krettek
Hi Vishal,

to answer the original question: it should not assumed that mutations of the element will be reflected downstream. For your situation this means that you have to use a ProcessingFunction to put the timestamp of a record into the record itself.

Also, Flink 1.6 will come with the next version of the BucketingSink called StreamingFileSink, where the Bucketer interface was updated to allow access to the element timestamp. The new interface is now called BucketAssigner.

Best,
Aljoscha

On 1. Aug 2018, at 16:36, Hequn Cheng <[hidden email]> wrote:

Hi Vishal,

We have a use case where multiple topics are streamed to hdfs and we would want to created buckets based on ingestion time 
If I understand correctly, you want to create buckets based on event time. Maybe you can use window[1]. For example, a tumbling window of 5 minutes groups rows in 5 minutes intervals. And you can get window start time(TUMBLE_START(time_attr, interval)) and end time(TUMBLE_END(time_attr, interval)) when output data.

Best, Hequn


On Wed, Aug 1, 2018 at 8:21 PM, Vishal Santoshi <[hidden email]> wrote:
Any feedbaxk?

On Tue, Jul 31, 2018, 10:20 AM Vishal Santoshi <[hidden email]> wrote:
In fact it may be available else where too ( for example ProcessFunction etc ) but do we have no need to create one, it is just a data relay ( kafka to hdfs ) and any intermediate processing should be avoided if possible IMHO.

On Tue, Jul 31, 2018 at 9:10 AM, Vishal Santoshi <[hidden email]> wrote:
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal









Reply | Threaded
Open this post in threaded view
|

Re: Access to Kafka Event Time

Vishal Santoshi
Thanks a lot! Awesome that 1.6 will have the ts of the element....

On Tue, Aug 7, 2018, 4:19 AM Aljoscha Krettek <[hidden email]> wrote:
Hi Vishal,

to answer the original question: it should not assumed that mutations of the element will be reflected downstream. For your situation this means that you have to use a ProcessingFunction to put the timestamp of a record into the record itself.

Also, Flink 1.6 will come with the next version of the BucketingSink called StreamingFileSink, where the Bucketer interface was updated to allow access to the element timestamp. The new interface is now called BucketAssigner.

Best,
Aljoscha

On 1. Aug 2018, at 16:36, Hequn Cheng <[hidden email]> wrote:

Hi Vishal,

We have a use case where multiple topics are streamed to hdfs and we would want to created buckets based on ingestion time 
If I understand correctly, you want to create buckets based on event time. Maybe you can use window[1]. For example, a tumbling window of 5 minutes groups rows in 5 minutes intervals. And you can get window start time(TUMBLE_START(time_attr, interval)) and end time(TUMBLE_END(time_attr, interval)) when output data.

Best, Hequn


On Wed, Aug 1, 2018 at 8:21 PM, Vishal Santoshi <[hidden email]> wrote:
Any feedbaxk?

On Tue, Jul 31, 2018, 10:20 AM Vishal Santoshi <[hidden email]> wrote:
In fact it may be available else where too ( for example ProcessFunction etc ) but do we have no need to create one, it is just a data relay ( kafka to hdfs ) and any intermediate processing should be avoided if possible IMHO.

On Tue, Jul 31, 2018 at 9:10 AM, Vishal Santoshi <[hidden email]> wrote:
We have a use case where multiple topics are streamed to hdfsand we would want to created buckets based on ingestion time ( the time the event were pushed to kafka ). Our producers to kafka will set that the event time


suggests that the the "previousElementTimeStamp" will provide that timestamp provided "EventTime" characteristic is set. It also provides for the element. In out case the element will expose setIngestionTIme(long time) method. Is the element in this method
public long extractTimestamp(Long element, long previousElementTimestamp)
 passed by reference and can it be safely ( loss lessly ) mutated for downstream operators ?


That said there is another place where that record time stamp is available.


Is it possible to change the signature of the 


to add record timestamp as the last argument ? 

Regards, 

Vishal