DataStream in batch mode - handling (un)ordered bounded data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataStream in batch mode - handling (un)ordered bounded data

sardaesp

Hello,

 

Regarding the new BATCH mode of the data stream API, I see that the documentation states that some operators will process all data for a given key before moving on to the next one. However, I don’t see how Flink is supposed to know whether the input will provide all data for a given key sequentially. In the DataSet API, an (undocumented?) feature is using SplitDataProperties (https://ci.apache.org/projects/flink/flink-docs-release-1.12/api/java/org/apache/flink/api/java/io/SplitDataProperties.html) to specify different grouping/partitioning/sorting properties, so if the data is pre-sorted (e.g. when reading from a database), some operations can be optimized. Will the DataStream API get something similar?

 

Regards,

Alexis.

 

Reply | Threaded
Open this post in threaded view
|

Re: DataStream in batch mode - handling (un)ordered bounded data

Dawid Wysakowicz-2

Hi Alexis,

As of now there is no such feature in the DataStream API. The Batch mode in DataStream API is a new feature and we would be interested to hear about the use cases people want to use it for to identify potential areas to improve. What you are suggesting generally make sense so I think it would be nice if you could create a jira ticket for it.

Best,

Dawid

On 12/03/2021 15:37, Alexis Sarda-Espinosa wrote:

Hello,

 

Regarding the new BATCH mode of the data stream API, I see that the documentation states that some operators will process all data for a given key before moving on to the next one. However, I don’t see how Flink is supposed to know whether the input will provide all data for a given key sequentially. In the DataSet API, an (undocumented?) feature is using SplitDataProperties (https://ci.apache.org/projects/flink/flink-docs-release-1.12/api/java/org/apache/flink/api/java/io/SplitDataProperties.html) to specify different grouping/partitioning/sorting properties, so if the data is pre-sorted (e.g. when reading from a database), some operations can be optimized. Will the DataStream API get something similar?

 

Regards,

Alexis.

 


OpenPGP_signature (855 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DataStream in batch mode - handling (un)ordered bounded data

sardaesp
In reply to this post by sardaesp
Hi Dawid,

I've entered a ticket: https://issues.apache.org/jira/browse/FLINK-21763. Personally, I can keep using the DataSet API for now, but if it will be deprecated at some point, it would be good to migrate rather sooner than later.

Regards,
Alexis.


From: Dawid Wysakowicz
Sent: Friday, March 12, 2021 4:10 PM
To: Alexis Sarda-Espinosa; [hidden email]
Subject: Re: DataStream in batch mode - handling (un)ordered bounded data

Hi Alexis,

As of now there is no such feature in the DataStream API. The Batch mode in DataStream API is a new feature and we would be interested to hear about the use cases people want to use it for to identify potential areas to improve. What you are suggesting generally make sense so I think it would be nice if you could create a jira ticket for it.

Best,

Dawid

On 12/03/2021 15:37, Alexis Sarda-Espinosa wrote:

Hello,

 

Regarding the new BATCH mode of the data stream API, I see that the documentation states that some operators will process all data for a given key before moving on to the next one. However, I don’t see how Flink is supposed to know whether the input will provide all data for a given key sequentially. In the DataSet API, an (undocumented?) feature is using SplitDataProperties (https://ci.apache.org/projects/flink/flink-docs-release-1.12/api/java/org/apache/flink/api/java/io/SplitDataProperties.html) to specify different grouping/partitioning/sorting properties, so if the data is pre-sorted (e.g. when reading from a database), some operations can be optimized. Will the DataStream API get something similar?

 

Regards,

Alexis.