Flushing the result of a groupReduce to a Sink before all reduces complete

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flushing the result of a groupReduce to a Sink before all reduces complete

Paul Wilson
Hi,

DataSet API 
Flink 1.1.3

I have an application where I'd like to perform some mapping before batching the results and passing them to the sink. I'm performing a 'composite' key selection to group the items by their natural key as well as a batch (itemCount / batchSize). When I reduce the batches and pass them to the sink, the whole flow is waiting for all reduces to complete before passing them to sink. 

Is there some way that the results of a single group reduce can be passed to the sink before all reduces are complete? 

Hope that makes sense,
Regards,
Paul
Reply | Threaded
Open this post in threaded view
|

Re: Flushing the result of a groupReduce to a Sink before all reduces complete

Fabian Hueske-2
Hi Paul,

Flink pushes the results of operators (including GroupReduce) to the next operator or sink as soon as they are computed. So what you are asking for is actually happening.
However, before the GroupReduceFunction can be applied, the whole data is sorted in order to group the data. This step is usually more expensive than applying the GroupReduceFunction. Therefore, it looks like the output is batched.
Flink does only support sort-based grouping, however also hash-based grouping would not help, because Flink would not know when to close a group until all data is consumed.

Please let me know if you have further questions.

Best, Fabian


2016-10-26 19:07 GMT+02:00 Paul Wilson <[hidden email]>:
Hi,

DataSet API 
Flink 1.1.3

I have an application where I'd like to perform some mapping before batching the results and passing them to the sink. I'm performing a 'composite' key selection to group the items by their natural key as well as a batch (itemCount / batchSize). When I reduce the batches and pass them to the sink, the whole flow is waiting for all reduces to complete before passing them to sink. 

Is there some way that the results of a single group reduce can be passed to the sink before all reduces are complete? 

Hope that makes sense,
Regards,
Paul

Reply | Threaded
Open this post in threaded view
|

Re: Flushing the result of a groupReduce to a Sink before all reduces complete

Paul Wilson
Hi Fabian,

We have reworked our execution to remove the group reduce step and replaced it with a map partition and we're seeing data passing more immediately now. 

Thanks for your quick reply, it was very useful. 

Regards,
Paul

On 26 October 2016 at 19:57, Fabian Hueske <[hidden email]> wrote:
Hi Paul,

Flink pushes the results of operators (including GroupReduce) to the next operator or sink as soon as they are computed. So what you are asking for is actually happening.
However, before the GroupReduceFunction can be applied, the whole data is sorted in order to group the data. This step is usually more expensive than applying the GroupReduceFunction. Therefore, it looks like the output is batched.
Flink does only support sort-based grouping, however also hash-based grouping would not help, because Flink would not know when to close a group until all data is consumed.

Please let me know if you have further questions.

Best, Fabian


2016-10-26 19:07 GMT+02:00 Paul Wilson <[hidden email]>:
Hi,

DataSet API 
Flink 1.1.3

I have an application where I'd like to perform some mapping before batching the results and passing them to the sink. I'm performing a 'composite' key selection to group the items by their natural key as well as a batch (itemCount / batchSize). When I reduce the batches and pass them to the sink, the whole flow is waiting for all reduces to complete before passing them to sink. 

Is there some way that the results of a single group reduce can be passed to the sink before all reduces are complete? 

Hope that makes sense,
Regards,
Paul