(DEPRECATED) Apache Flink User Mailing List archive.

DataStream in batch mode - handling (un)ordered bounded data

Classic

List

Threaded

3 messages Options

sardaesp

DataStream in batch mode - handling (un)ordered bounded data

Hello,

Regarding the new BATCH mode of the data stream API, I see that the documentation states that some operators will process all data for a given key before moving on to the next one. However, I don’t see how Flink is supposed to know whether the input will provide all data for a given key sequentially. In the DataSet API, an (undocumented?) feature is using SplitDataProperties (https://ci.apache.org/projects/flink/flink-docs-release-1.12/api/java/org/apache/flink/api/java/io/SplitDataProperties.html) to specify different grouping/partitioning/sorting properties, so if the data is pre-sorted (e.g. when reading from a database), some operations can be optimized. Will the DataStream API get something similar?

Regards,

Alexis.

Dawid Wysakowicz-2

Re: DataStream in batch mode - handling (un)ordered bounded data

Hi Alexis,

As of now there is no such feature in the DataStream API. The Batch mode in DataStream API is a new feature and we would be interested to hear about the use cases people want to use it for to identify potential areas to improve. What you are suggesting generally make sense so I think it would be nice if you could create a jira ticket for it.

Best,

Dawid

On 12/03/2021 15:37, Alexis Sarda-Espinosa wrote:

Hello,

Regarding the new BATCH mode of the data stream API, I see that the documentation states that some operators will process all data for a given key before moving on to the next one. However, I don’t see how Flink is supposed to know whether the input will provide all data for a given key sequentially. In the DataSet API, an (undocumented?) feature is using SplitDataProperties (https://ci.apache.org/projects/flink/flink-docs-release-1.12/api/java/org/apache/flink/api/java/io/SplitDataProperties.html) to specify different grouping/partitioning/sorting properties, so if the data is pre-sorted (e.g. when reading from a database), some operations can be optimized. Will the DataStream API get something similar?

Regards,

Alexis.

OpenPGP_signature (855 bytes) Download Attachment

sardaesp

Re: DataStream in batch mode - handling (un)ordered bounded data

In reply to this post by sardaesp

Hi Dawid,

I've entered a ticket: https://issues.apache.org/jira/browse/FLINK-21763. Personally, I can keep using the DataSet API for now, but if it will be deprecated at some point, it would be good to migrate rather sooner than later.