Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

AJ Heller
The goal is: 
 * to split data, random-uniformly, across N nodes, 
 * window the data identically on each node,
 * transform the windows locally on each node, and
 * merge the N parallel windows into a global window stream, such that one window from each parallel process is merged into a "global window" aggregate

I've achieved all but the last bullet point, merging one window from each partition into a globally-aggregated window output stream.

To be clear, a rolling reduce won't work because it would aggregate over all previous windows in all partitioned streams, and I only need to aggregate over one window from each partition at a time.

Similarly for a fold.

The closest I have found is ParallelMerge for ConnectedStreams, but I have not found a way to apply it to this problem. Can flink achieve this? If so, I'd greatly appreciate a point in the right direction.

Cheers,
-aj
Reply | Threaded
Open this post in threaded view
|

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

Fabian Hueske-2
Maybe this can be done by assigning the same window id to each of the N local windows, and do a

.keyBy(windowId)
.countWindow(N)

This should create a new global window for each window id and collect all N windows.

Best, Fabian

2016-10-06 22:39 GMT+02:00 AJ Heller <[hidden email]>:
The goal is: 
 * to split data, random-uniformly, across N nodes, 
 * window the data identically on each node,
 * transform the windows locally on each node, and
 * merge the N parallel windows into a global window stream, such that one window from each parallel process is merged into a "global window" aggregate

I've achieved all but the last bullet point, merging one window from each partition into a globally-aggregated window output stream.

To be clear, a rolling reduce won't work because it would aggregate over all previous windows in all partitioned streams, and I only need to aggregate over one window from each partition at a time.

Similarly for a fold.

The closest I have found is ParallelMerge for ConnectedStreams, but I have not found a way to apply it to this problem. Can flink achieve this? If so, I'd greatly appreciate a point in the right direction.

Cheers,
-aj

Reply | Threaded
Open this post in threaded view
|

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

AJ Heller
Thank you Fabian, I think that solves it. I'll need to rig up some tests to verify, but it looks good.

I used a RichMapFunction to assign ids incrementally to windows (mapping STREAM_OBJECT to Tuple2<Long, STREAM_OBJECT> using a private long value in the mapper that increments on every map call). It works, but by any chance is there a more succinct way to do it?

On Thu, Oct 6, 2016 at 1:50 PM, Fabian Hueske <[hidden email]> wrote:
Maybe this can be done by assigning the same window id to each of the N local windows, and do a

.keyBy(windowId)
.countWindow(N)

This should create a new global window for each window id and collect all N windows.

Best, Fabian

2016-10-06 22:39 GMT+02:00 AJ Heller <[hidden email]>:
The goal is: 
 * to split data, random-uniformly, across N nodes, 
 * window the data identically on each node,
 * transform the windows locally on each node, and
 * merge the N parallel windows into a global window stream, such that one window from each parallel process is merged into a "global window" aggregate

I've achieved all but the last bullet point, merging one window from each partition into a globally-aggregated window output stream.

To be clear, a rolling reduce won't work because it would aggregate over all previous windows in all partitioned streams, and I only need to aggregate over one window from each partition at a time.

Similarly for a fold.

The closest I have found is ParallelMerge for ConnectedStreams, but I have not found a way to apply it to this problem. Can flink achieve this? If so, I'd greatly appreciate a point in the right direction.

Cheers,
-aj


Reply | Threaded
Open this post in threaded view
|

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

Fabian Hueske-2
If you are using time windows, you can access the TimeWindow parameter of the WindowFunction.apply() method.
The TimeWindow contains the start and end timestamp of a window (as Long) which can act as keys.

If you are using count windows, I think you have to use a counter as you described.


2016-10-07 1:06 GMT+02:00 AJ Heller <[hidden email]>:
Thank you Fabian, I think that solves it. I'll need to rig up some tests to verify, but it looks good.

I used a RichMapFunction to assign ids incrementally to windows (mapping STREAM_OBJECT to Tuple2<Long, STREAM_OBJECT> using a private long value in the mapper that increments on every map call). It works, but by any chance is there a more succinct way to do it?

On Thu, Oct 6, 2016 at 1:50 PM, Fabian Hueske <[hidden email]> wrote:
Maybe this can be done by assigning the same window id to each of the N local windows, and do a

.keyBy(windowId)
.countWindow(N)

This should create a new global window for each window id and collect all N windows.

Best, Fabian

2016-10-06 22:39 GMT+02:00 AJ Heller <[hidden email]>:
The goal is: 
 * to split data, random-uniformly, across N nodes, 
 * window the data identically on each node,
 * transform the windows locally on each node, and
 * merge the N parallel windows into a global window stream, such that one window from each parallel process is merged into a "global window" aggregate

I've achieved all but the last bullet point, merging one window from each partition into a globally-aggregated window output stream.

To be clear, a rolling reduce won't work because it would aggregate over all previous windows in all partitioned streams, and I only need to aggregate over one window from each partition at a time.

Similarly for a fold.

The closest I have found is ParallelMerge for ConnectedStreams, but I have not found a way to apply it to this problem. Can flink achieve this? If so, I'd greatly appreciate a point in the right direction.

Cheers,
-aj



Reply | Threaded
Open this post in threaded view
|

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

Aljoscha Krettek
Hi,
are you windowing based on event time?

Cheers,
Aljoscha

On Fri, 7 Oct 2016 at 09:28 Fabian Hueske <[hidden email]> wrote:
If you are using time windows, you can access the TimeWindow parameter of the WindowFunction.apply() method.
The TimeWindow contains the start and end timestamp of a window (as Long) which can act as keys.

If you are using count windows, I think you have to use a counter as you described.


2016-10-07 1:06 GMT+02:00 AJ Heller <[hidden email]>:
Thank you Fabian, I think that solves it. I'll need to rig up some tests to verify, but it looks good.

I used a RichMapFunction to assign ids incrementally to windows (mapping STREAM_OBJECT to Tuple2<Long, STREAM_OBJECT> using a private long value in the mapper that increments on every map call). It works, but by any chance is there a more succinct way to do it?

On Thu, Oct 6, 2016 at 1:50 PM, Fabian Hueske <[hidden email]> wrote:
Maybe this can be done by assigning the same window id to each of the N local windows, and do a

.keyBy(windowId)
.countWindow(N)

This should create a new global window for each window id and collect all N windows.

Best, Fabian

2016-10-06 22:39 GMT+02:00 AJ Heller <[hidden email]>:
The goal is: 
 * to split data, random-uniformly, across N nodes, 
 * window the data identically on each node,
 * transform the windows locally on each node, and
 * merge the N parallel windows into a global window stream, such that one window from each parallel process is merged into a "global window" aggregate

I've achieved all but the last bullet point, merging one window from each partition into a globally-aggregated window output stream.

To be clear, a rolling reduce won't work because it would aggregate over all previous windows in all partitioned streams, and I only need to aggregate over one window from each partition at a time.

Similarly for a fold.

The closest I have found is ParallelMerge for ConnectedStreams, but I have not found a way to apply it to this problem. Can flink achieve this? If so, I'd greatly appreciate a point in the right direction.

Cheers,
-aj