Performing Reduce on a group of datasets

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Performing Reduce on a group of datasets

Ritesh Kumar Singh
Hi,

How can I perform a reduce operation on a group of datasets using Flink? 
Let's say my map function gives out n datasets: d1, d2, ... dN
Now I wish to perform my reduce operation on all the N datasets at once and not on an individual level. The only way I figured out till now is using the union operator first like following:

List<Dataset<X>> dataList = Arrays.asList(d1, d2, ... dN);
Dataset<X> dFinal = null;
for(Dataset<X> ds: dataList)
{
    dFinal = dFinal.union(ds);
}
dFinal.groupBy(0).reduce(...);

Is there a more efficient way of doing the above task using java APIs? GroupReduce only works on a single dataset at a time and I can't find any other methods that take multiple datasets as an input parameter.

Thanks, 
--
Reply | Threaded
Open this post in threaded view
|

Re: Performing Reduce on a group of datasets

Fabian Hueske-2
I think union is what you are looking for.
Note that all data sets must be of the same type.

2016-05-18 16:15 GMT+02:00 Ritesh Kumar Singh <[hidden email]>:
Hi,

How can I perform a reduce operation on a group of datasets using Flink? 
Let's say my map function gives out n datasets: d1, d2, ... dN
Now I wish to perform my reduce operation on all the N datasets at once and not on an individual level. The only way I figured out till now is using the union operator first like following:

List<Dataset<X>> dataList = Arrays.asList(d1, d2, ... dN);
Dataset<X> dFinal = null;
for(Dataset<X> ds: dataList)
{
    dFinal = dFinal.union(ds);
}
dFinal.groupBy(0).reduce(...);

Is there a more efficient way of doing the above task using java APIs? GroupReduce only works on a single dataset at a time and I can't find any other methods that take multiple datasets as an input parameter.

Thanks, 
--

Reply | Threaded
Open this post in threaded view
|

Re: Performing Reduce on a group of datasets

Ritesh Kumar Singh
Thanks for the reply Fabian, 

Though here's a small thing I found on the documentation page:

If you look into the Union section, "This operation happens implicitly if more than one data set is used for a specific function input." , I'm not sure what this is supposed to mean. My initial assumption was something like: 
    
    dFinal = dFinal.union( d1, d2, ... , dN); // passing more than one dataset as function input. 

But as expected, this does not satisfy the union method signature. And so is that line supposed to mean something else? Or is it a feature not supported by flink 0.8 but works with future releases?

Thanks, 
Reply | Threaded
Open this post in threaded view
|

Re: Performing Reduce on a group of datasets

Fabian Hueske-2
I think that sentence is misleading and refers to the internals of Flink. It should be removed, IMO.
You can only union two DataSets. If you want to union more, you have to do it one by one.

Btw. union does not cause additional processing overhead.

Cheers, Fabian

2016-05-19 14:44 GMT+02:00 Ritesh Kumar Singh <[hidden email]>:
Thanks for the reply Fabian, 

Though here's a small thing I found on the documentation page:

If you look into the Union section, "This operation happens implicitly if more than one data set is used for a specific function input." , I'm not sure what this is supposed to mean. My initial assumption was something like: 
    
    dFinal = dFinal.union( d1, d2, ... , dN); // passing more than one dataset as function input. 

But as expected, this does not satisfy the union method signature. And so is that line supposed to mean something else? Or is it a feature not supported by flink 0.8 but works with future releases?

Thanks,