Split a dataset

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Split a dataset

Magnus Vojbacke
I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?

Reply | Threaded
Open this post in threaded view
|

Re: Split a dataset

Fabian Hueske-2
Hi Magnus,

there is no Split operator on the DataSet API.

As you said, this can be done using a FilterFunction. This also allows for non-binary splits:

DataSet<X> setToSplit = ...
DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());

where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunction that filter out all records that don't belong to the split.

Best, Fabian

2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <[hidden email]>:
I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?


Reply | Threaded
Open this post in threaded view
|

Re: Split a dataset

Magnus Vojbacke
Thank you, Fabian! If batch semantics are not important to my use case, is there any way to "downgrade" or convert a DataSet to a DataStream?

BR
/Magnus

On 17 Oct 2017, at 10:54, Fabian Hueske <[hidden email]> wrote:

Hi Magnus,

there is no Split operator on the DataSet API.

As you said, this can be done using a FilterFunction. This also allows for non-binary splits:

DataSet<X> setToSplit = ...
DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());

where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunction that filter out all records that don't belong to the split.

Best, Fabian

2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <[hidden email]>:
I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?



Reply | Threaded
Open this post in threaded view
|

Re: Split a dataset

Fabian Hueske-2
Unfortunately, it's not possible to bridge the gap between the DataSet and DataStream APIs.

However, you can also use a CsvInputFormat in the DataStream API. Since there's no built-in API to configure the CSV input, you would have to create (and configure) the CsvInputFormat yourself.
Once you have the CsvInputFormat, you can create a DataStream using StreamExecutionEnvironment.readFile(csvIF).

Hope this helps,
Fabian

2017-10-17 11:05 GMT+02:00 Magnus Vojbacke <[hidden email]>:
Thank you, Fabian! If batch semantics are not important to my use case, is there any way to "downgrade" or convert a DataSet to a DataStream?

BR
/Magnus

On 17 Oct 2017, at 10:54, Fabian Hueske <[hidden email]> wrote:

Hi Magnus,

there is no Split operator on the DataSet API.

As you said, this can be done using a FilterFunction. This also allows for non-binary splits:

DataSet<X> setToSplit = ...
DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());

where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunction that filter out all records that don't belong to the split.

Best, Fabian

2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <[hidden email]>:
I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?