Consistent (hashing) keyBy over multiple time or streaming windows

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Consistent (hashing) keyBy over multiple time or streaming windows

Leonard Wolters
Hi,

I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows.
The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same
partitions. As one can understand I want to avoid heavy shuffling over the network.

As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events
within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed
the window submit duration, I want some guarantees that events belonging to the same session are delivered to the
same partition for new windows.

Is this possible or are they any plans to support this soon?

Thanks in advance,

Leonard

--
Leonard Wolters
Chief Product Manager
M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555
E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV
Herengracht 504 | 1017CB Amsterdam | Netherlands
Reply | Threaded
Open this post in threaded view
|

Re: Consistent (hashing) keyBy over multiple time or streaming windows

Aljoscha Krettek
Hi Leonard,
I’m afraid you might be thinking about windows as they are supported by Spark Streaming. There windows are quite limited. In Flink you don’t necessarily have to window elements by time since Flink does not collect data in mini-batches before processing. Everything is continuously processed and you can have arbitrary Trigger strategies that decide when you want to process windows.

The basic idea behind windowing in Flink is that elements are assigned to windows by a WindowAssigner and then a Trigger decides when to trigger computation for a specific window. This is very similar to the model employed by Google Cloud Dataflow, if you are familiar with that.

You could have a look at this Stackoverflow question and my answer to it: http://stackoverflow.com/questions/33451121/apache-flink-session-support. It could be similar to your use case.

Please let me know if you want a more in-depth explanation of the windowing system. It is a quite recent addition and arguably the most complex part in any streaming system.

Cheers,
Aljoscha

> On 02 Nov 2015, at 09:15, Leonard Wolters <[hidden email]> wrote:
>
> Hi,
>
> I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows.
> The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same
> partitions. As one can understand I want to avoid heavy shuffling over the network.
>
> As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events
> within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed
> the window submit duration, I want some guarantees that events belonging to the same session are delivered to the
> same partition for new windows.
>
> Is this possible or are they any plans to support this soon?
>
> Thanks in advance,
>
> Leonard
>
> --
> Leonard Wolters
> Chief Product Manager
> <logo.png>
> M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555
> E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV
> Herengracht 504 | 1017CB Amsterdam | Netherlands

Reply | Threaded
Open this post in threaded view
|

Re: Consistent (hashing) keyBy over multiple time or streaming windows

Leonard Wolters
Hi Aljoscha,

Thanks for the quick response.
I've seen the Google Data Flow presentation @ Flink forward and understand the
concepts behind it (which are also supported by Flink).

I will further look into stack overflow and let you know if I have some further questions.

Once again, thanks,

Leonard

On 02-11-15 10:25, Aljoscha Krettek wrote:
Hi Leonard,
I’m afraid you might be thinking about windows as they are supported by Spark Streaming. There windows are quite limited. In Flink you don’t necessarily have to window elements by time since Flink does not collect data in mini-batches before processing. Everything is continuously processed and you can have arbitrary Trigger strategies that decide when you want to process windows.

The basic idea behind windowing in Flink is that elements are assigned to windows by a WindowAssigner and then a Trigger decides when to trigger computation for a specific window. This is very similar to the model employed by Google Cloud Dataflow, if you are familiar with that.

You could have a look at this Stackoverflow question and my answer to it: http://stackoverflow.com/questions/33451121/apache-flink-session-support. It could be similar to your use case.

Please let me know if you want a more in-depth explanation of the windowing system. It is a quite recent addition and arguably the most complex part in any streaming system.

Cheers,
Aljoscha
On 02 Nov 2015, at 09:15, Leonard Wolters [hidden email] wrote:

Hi,

I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows.
The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same
partitions. As one can understand I want to avoid heavy shuffling over the network.

As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events
within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed
the window submit duration, I want some guarantees that events belonging to the same session are delivered to the 
same partition for new windows. 

Is this possible or are they any plans to support this soon?

Thanks in advance,

Leonard

-- 
Leonard Wolters
Chief Product Manager
<logo.png>
M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555 
E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV 
Herengracht 504 | 1017CB Amsterdam | Netherlands

    

--
Leonard Wolters
Chief Product Manager
M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555
E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV
Herengracht 504 | 1017CB Amsterdam | Netherlands