Hi,
I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows. The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same partitions. As one can understand I want to avoid heavy shuffling over the network. As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed the window submit duration, I want some guarantees that events belonging to the same session are delivered to the same partition for new windows. Is this possible or are they any plans to support this soon? Thanks in advance, Leonard --
|
Hi Leonard,
I’m afraid you might be thinking about windows as they are supported by Spark Streaming. There windows are quite limited. In Flink you don’t necessarily have to window elements by time since Flink does not collect data in mini-batches before processing. Everything is continuously processed and you can have arbitrary Trigger strategies that decide when you want to process windows. The basic idea behind windowing in Flink is that elements are assigned to windows by a WindowAssigner and then a Trigger decides when to trigger computation for a specific window. This is very similar to the model employed by Google Cloud Dataflow, if you are familiar with that. You could have a look at this Stackoverflow question and my answer to it: http://stackoverflow.com/questions/33451121/apache-flink-session-support. It could be similar to your use case. Please let me know if you want a more in-depth explanation of the windowing system. It is a quite recent addition and arguably the most complex part in any streaming system. Cheers, Aljoscha > On 02 Nov 2015, at 09:15, Leonard Wolters <[hidden email]> wrote: > > Hi, > > I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows. > The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same > partitions. As one can understand I want to avoid heavy shuffling over the network. > > As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events > within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed > the window submit duration, I want some guarantees that events belonging to the same session are delivered to the > same partition for new windows. > > Is this possible or are they any plans to support this soon? > > Thanks in advance, > > Leonard > > -- > Leonard Wolters > Chief Product Manager > <logo.png> > M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555 > E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV > Herengracht 504 | 1017CB Amsterdam | Netherlands |
Hi Aljoscha,
Thanks for the quick response. I've seen the Google Data Flow presentation @ Flink forward and understand the concepts behind it (which are also supported by Flink). I will further look into stack overflow and let you know if I have some further questions. Once again, thanks, Leonard On 02-11-15 10:25, Aljoscha Krettek
wrote:
Hi Leonard, I’m afraid you might be thinking about windows as they are supported by Spark Streaming. There windows are quite limited. In Flink you don’t necessarily have to window elements by time since Flink does not collect data in mini-batches before processing. Everything is continuously processed and you can have arbitrary Trigger strategies that decide when you want to process windows. The basic idea behind windowing in Flink is that elements are assigned to windows by a WindowAssigner and then a Trigger decides when to trigger computation for a specific window. This is very similar to the model employed by Google Cloud Dataflow, if you are familiar with that. You could have a look at this Stackoverflow question and my answer to it: http://stackoverflow.com/questions/33451121/apache-flink-session-support. It could be similar to your use case. Please let me know if you want a more in-depth explanation of the windowing system. It is a quite recent addition and arguably the most complex part in any streaming system. Cheers, AljoschaOn 02 Nov 2015, at 09:15, Leonard Wolters [hidden email] wrote: Hi, I was wondering if Flink already has implemented some sort of consistent keyBy mapping over multiple windows. The underlying idea is to 'sessionize' incoming events over time (i.e. multiple streaming windows) on the same partitions. As one can understand I want to avoid heavy shuffling over the network. As far as I could read / understand from the API docs and the blogs on data-artisans, I can perfectly groupBy events within one (time) window that are automatically distributed over all partitions. However, since sessions often exceed the window submit duration, I want some guarantees that events belonging to the same session are delivered to the same partition for new windows. Is this possible or are they any plans to support this soon? Thanks in advance, Leonard -- Leonard Wolters Chief Product Manager <logo.png> M: +31 (0)6 55 53 04 01 | T: +31 (0)88 10 44 555 E: [hidden email] | W: sagent.io | Disclaimer | Sagent BV Herengracht 504 | 1017CB Amsterdam | Netherlands --
|
Free forum by Nabble | Edit this page |