Flink - Process datastream in a bounded context (like Dataset) - Unifying stream & batch

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink - Process datastream in a bounded context (like Dataset) - Unifying stream & batch

bastien dine
Hello everyone,

I need to join some files to perform some processing.. The dataset API is a perfect way to achieve this, I am able to do it when I read file in batch (csv)

However in the prod environment, I will receive thoses files in kafka messages (one message = one line of a file)
So I am considering using a global window + a custom trigger on a end of file message and a process window function.
But I can not go too far with that as process is only one function and chaining functions will be a pain. I don't think that emitting a datastream & windows / trigger on EOF before every process function is a good idea

However I would like to work in a bounded way once I received all of my elements (after the trigger on global window), like the dataset API, as I will join on my whole dataset..

I thought maybe it would be a good idea to go for table API and group window ? but you can not have custom trigger and a global group window on a table ?(like the global window on datastream ?)
Best alternative would be to create a dataset as a result of my process window function.. but I don't think this is possible, is it ?

Best Regards,
Bastien
Reply | Threaded
Open this post in threaded view
|

Re: Flink - Process datastream in a bounded context (like Dataset) - Unifying stream & batch

Hequn Cheng
Hi bastien,

Flink features two relational APIs, the Table API and SQL. Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded[1].
There are also documents about Join[2].

Best, Hequn

On Tue, Sep 25, 2018 at 4:14 PM bastien dine <[hidden email]> wrote:
Hello everyone,

I need to join some files to perform some processing.. The dataset API is a perfect way to achieve this, I am able to do it when I read file in batch (csv)

However in the prod environment, I will receive thoses files in kafka messages (one message = one line of a file)
So I am considering using a global window + a custom trigger on a end of file message and a process window function.
But I can not go too far with that as process is only one function and chaining functions will be a pain. I don't think that emitting a datastream & windows / trigger on EOF before every process function is a good idea

However I would like to work in a bounded way once I received all of my elements (after the trigger on global window), like the dataset API, as I will join on my whole dataset..

I thought maybe it would be a good idea to go for table API and group window ? but you can not have custom trigger and a global group window on a table ?(like the global window on datastream ?)
Best alternative would be to create a dataset as a result of my process window function.. but I don't think this is possible, is it ?

Best Regards,
Bastien
Reply | Threaded
Open this post in threaded view
|

Re: Flink - Process datastream in a bounded context (like Dataset) - Unifying stream & batch

bastien dine
Hi Hequn,

Thanks for your response 
Yea I know about the table API, but I am searching a way to have a bounded context with a stream, somehow create a dataset from a buffer store in a window of datastream

Regards, Bastien

Le mar. 25 sept. 2018 à 14:50, Hequn Cheng <[hidden email]> a écrit :
Hi bastien,

Flink features two relational APIs, the Table API and SQL. Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded[1].
There are also documents about Join[2].

Best, Hequn

On Tue, Sep 25, 2018 at 4:14 PM bastien dine <[hidden email]> wrote:
Hello everyone,

I need to join some files to perform some processing.. The dataset API is a perfect way to achieve this, I am able to do it when I read file in batch (csv)

However in the prod environment, I will receive thoses files in kafka messages (one message = one line of a file)
So I am considering using a global window + a custom trigger on a end of file message and a process window function.
But I can not go too far with that as process is only one function and chaining functions will be a pain. I don't think that emitting a datastream & windows / trigger on EOF before every process function is a good idea

However I would like to work in a bounded way once I received all of my elements (after the trigger on global window), like the dataset API, as I will join on my whole dataset..

I thought maybe it would be a good idea to go for table API and group window ? but you can not have custom trigger and a global group window on a table ?(like the global window on datastream ?)
Best alternative would be to create a dataset as a result of my process window function.. but I don't think this is possible, is it ?

Best Regards,
Bastien
Reply | Threaded
Open this post in threaded view
|

Re: Flink - Process datastream in a bounded context (like Dataset) - Unifying stream & batch

Hequn Cheng
Hi bastien,

Could you give more details about your scenario? Do you want to load another file from the same kafka after current file has been processed? 
I am curious about why you want to join data in a bounded way when the input data is a stream. The stream-stream join outputs same results as batch join. So you don't have to join data on condition that all data is received.
If you do want process data in this way, I think you can use udtf to achieve this behavior. In the udtf, data will be loaded from kafka and you can perform processing(join) once received all of the elements.

Best, Hequn

On Tue, Sep 25, 2018 at 10:07 PM bastien dine <[hidden email]> wrote:
Hi Hequn,

Thanks for your response 
Yea I know about the table API, but I am searching a way to have a bounded context with a stream, somehow create a dataset from a buffer store in a window of datastream

Regards, Bastien

Le mar. 25 sept. 2018 à 14:50, Hequn Cheng <[hidden email]> a écrit :
Hi bastien,

Flink features two relational APIs, the Table API and SQL. Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded[1].
There are also documents about Join[2].

Best, Hequn

On Tue, Sep 25, 2018 at 4:14 PM bastien dine <[hidden email]> wrote:
Hello everyone,

I need to join some files to perform some processing.. The dataset API is a perfect way to achieve this, I am able to do it when I read file in batch (csv)

However in the prod environment, I will receive thoses files in kafka messages (one message = one line of a file)
So I am considering using a global window + a custom trigger on a end of file message and a process window function.
But I can not go too far with that as process is only one function and chaining functions will be a pain. I don't think that emitting a datastream & windows / trigger on EOF before every process function is a good idea

However I would like to work in a bounded way once I received all of my elements (after the trigger on global window), like the dataset API, as I will join on my whole dataset..

I thought maybe it would be a good idea to go for table API and group window ? but you can not have custom trigger and a global group window on a table ?(like the global window on datastream ?)
Best alternative would be to create a dataset as a result of my process window function.. but I don't think this is possible, is it ?

Best Regards,
Bastien