(DEPRECATED) Apache Flink User Mailing List archive.

About delta awareness caches

Classic

List

Threaded

6 messages Options

Xingcan Cui

About delta awareness caches

Hi all,

Recently I tried to transfer some old applications from Storm to Flink.

In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.

However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.

So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,

Xingcan

Aljoscha Krettek

Re: About delta awareness caches

I'm not aware of how windows work in Storm. If you could maybe give some details on your use case we could figure out together how that would map to Flink windows.

Cheers,

Aljoscha

On Tue, 29 Nov 2016 at 15:47 xingcan <[hidden email]> wrote:

Hi all,

Recently I tried to transfer some old applications from Storm to Flink.
In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.
However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.
So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,
Xingcan

Xingcan Cui

Re: About delta awareness caches

Hi Aljoscha,

First of all, sorry for that I missed your prompt reply : (

In these days, I've been learning the implementation mechanism of window in Flink.

I think the main difference between the window in Storm and Flink (from the API level) is that, Storm maintains only one window while Flink maintains several isolated windows. Due to that, Storm users can be aware of the transformation (tuple add and expire) of a window and take actions on each window modification (sliding window forwarding) while Flink users can only implement functions on one and another complete window as if they are independent of each other (actually they may get quite a few tuples in common).

Objectively speaking, the window API provided by Flink is more formalize and easy to use. However, for sliding window with high-capacity and short interval (e.g. 5m and 1s), each tuple will be calculated redundantly (maybe 300 times in the example?). Though it provide the pane optimization, I think it's far from enough as the optimization can only be applied on reduce functions which restrict the input and output data type to be the same. Some other functions, e.g., the MaxAndMin function which take numbers as input and output a max&min pair and the Average function, which should avoid redundant calculations can not be satisfied.

Actually, I just wondering if a "mergeable fold function" could be added (just *like* this https://en.wikipedia.org/wiki/Mergeable_heap). I know it may violate some principles of Flink (probably about states), but I insist that unnecessary calculations should be avoided in stream processing.

So, could you give some advices, I am all ears : ), or if you think that is feasible, I'll think carefully and try to complete it.

Thank you and merry Christmas.

Best,

- Xingcan

On Thu, Dec 1, 2016 at 7:56 PM, Aljoscha Krettek <[hidden email]> wrote:

I'm not aware of how windows work in Storm. If you could maybe give some details on your use case we could figure out together how that would map to Flink windows.

Cheers,
Aljoscha

On Tue, 29 Nov 2016 at 15:47 xingcan <[hidden email]> wrote:
Hi all,

Recently I tried to transfer some old applications from Storm to Flink.
In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.
However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.
So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,
Xingcan

Aljoscha Krettek

Re: About delta awareness caches

Hi,

(I'm just getting back from holidays, therefore the slow response. Sorry for that.)

I think you can simulate the way Storm windows work by using a GlobalWindows assigner and having a custom Trigger and/or Evictor and also some special logic in your WindowFunction.

About mergeable state, we're actually in the process of adding something like this that would be a generalisation of reduce and fold: you can call it combine or aggregate. The idea is to have these operations:

- create accumulator

- add value to accumulator

- merge accumulators

- extract output from accumulator

You have three types: IN for incoming values, ACC for accumulators and OUT as the result of extracting output from an accumulator. This should cover most cases.

What do you think?

Cheers,

Aljoscha

On Thu, 22 Dec 2016 at 07:13 xingcan <[hidden email]> wrote:

Hi Aljoscha,

First of all, sorry for that I missed your prompt reply : (

In these days, I've been learning the implementation mechanism of window in Flink.

I think the main difference between the window in Storm and Flink (from the API level) is that, Storm maintains only one window while Flink maintains several isolated windows. Due to that, Storm users can be aware of the transformation (tuple add and expire) of a window and take actions on each window modification (sliding window forwarding) while Flink users can only implement functions on one and another complete window as if they are independent of each other (actually they may get quite a few tuples in common).

Objectively speaking, the window API provided by Flink is more formalize and easy to use. However, for sliding window with high-capacity and short interval (e.g. 5m and 1s), each tuple will be calculated redundantly (maybe 300 times in the example?). Though it provide the pane optimization, I think it's far from enough as the optimization can only be applied on reduce functions which restrict the input and output data type to be the same. Some other functions, e.g., the MaxAndMin function which take numbers as input and output a max&min pair and the Average function, which should avoid redundant calculations can not be satisfied.

Actually, I just wondering if a "mergeable fold function" could be added (just *like* this https://en.wikipedia.org/wiki/Mergeable_heap). I know it may violate some principles of Flink (probably about states), but I insist that unnecessary calculations should be avoided in stream processing.

So, could you give some advices, I am all ears : ), or if you think that is feasible, I'll think carefully and try to complete it.

Thank you and merry Christmas.

Best,

- Xingcan

On Thu, Dec 1, 2016 at 7:56 PM, Aljoscha Krettek <[hidden email]> wrote:
I'm not aware of how windows work in Storm. If you could maybe give some details on your use case we could figure out together how that would map to Flink windows.

Cheers,
Aljoscha

On Tue, 29 Nov 2016 at 15:47 xingcan <[hidden email]> wrote:
Hi all,

Recently I tried to transfer some old applications from Storm to Flink.
In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.
However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.
So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,
Xingcan

Xingcan Cui

Re: About delta awareness caches

Hi, Aljoscha

Thanks for your explanation.

About the Storm windows simulation, we had tried your suggestion and gave up due to its complexity and sort of "reinventing the wheel". Without considering the performance, most of our business-logic code have already been transformed to the "Flink style".

I am glad to hear that adding the accumulator is just in progress. As far as I can see, the operations it supplies will adequately meet the demands. I will stay focus on this topic.

Best,

Xingcan

On Wed, Jan 11, 2017 at 7:28 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
(I'm just getting back from holidays, therefore the slow response. Sorry for that.)

I think you can simulate the way Storm windows work by using a GlobalWindows assigner and having a custom Trigger and/or Evictor and also some special logic in your WindowFunction.

About mergeable state, we're actually in the process of adding something like this that would be a generalisation of reduce and fold: you can call it combine or aggregate. The idea is to have these operations:

- create accumulator
- add value to accumulator
- merge accumulators
- extract output from accumulator

You have three types: IN for incoming values, ACC for accumulators and OUT as the result of extracting output from an accumulator. This should cover most cases.

What do you think?

Cheers,
Aljoscha

On Thu, 22 Dec 2016 at 07:13 xingcan <[hidden email]> wrote:
Hi Aljoscha,

First of all, sorry for that I missed your prompt reply : (

In these days, I've been learning the implementation mechanism of window in Flink.

I think the main difference between the window in Storm and Flink (from the API level) is that, Storm maintains only one window while Flink maintains several isolated windows. Due to that, Storm users can be aware of the transformation (tuple add and expire) of a window and take actions on each window modification (sliding window forwarding) while Flink users can only implement functions on one and another complete window as if they are independent of each other (actually they may get quite a few tuples in common).

Objectively speaking, the window API provided by Flink is more formalize and easy to use. However, for sliding window with high-capacity and short interval (e.g. 5m and 1s), each tuple will be calculated redundantly (maybe 300 times in the example?). Though it provide the pane optimization, I think it's far from enough as the optimization can only be applied on reduce functions which restrict the input and output data type to be the same. Some other functions, e.g., the MaxAndMin function which take numbers as input and output a max&min pair and the Average function, which should avoid redundant calculations can not be satisfied.

Actually, I just wondering if a "mergeable fold function" could be added (just *like* this https://en.wikipedia.org/wiki/Mergeable_heap). I know it may violate some principles of Flink (probably about states), but I insist that unnecessary calculations should be avoided in stream processing.

So, could you give some advices, I am all ears : ), or if you think that is feasible, I'll think carefully and try to complete it.

Thank you and merry Christmas.

Best,

- Xingcan

On Thu, Dec 1, 2016 at 7:56 PM, Aljoscha Krettek <[hidden email]> wrote:
I'm not aware of how windows work in Storm. If you could maybe give some details on your use case we could figure out together how that would map to Flink windows.

Cheers,
Aljoscha

On Tue, 29 Nov 2016 at 15:47 xingcan <[hidden email]> wrote:
Hi all,

Recently I tried to transfer some old applications from Storm to Flink.
In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.
However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.
So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,
Xingcan

Aljoscha Krettek

Re: About delta awareness caches

Hi,

in fact, this was just merged: https://issues.apache.org/jira/browse/FLINK-5582. It will be released as part of Flink 1.3 in roughly 4 months. The feature is still a bit rough around the edges and needs some follow-up work, however.

Cheers,

Aljoscha

On Thu, 12 Jan 2017 at 11:10 Xingcan <[hidden email]> wrote:

Hi, Aljoscha

Thanks for your explanation.

About the Storm windows simulation, we had tried your suggestion and gave up due to its complexity and sort of "reinventing the wheel". Without considering the performance, most of our business-logic code have already been transformed to the "Flink style".

I am glad to hear that adding the accumulator is just in progress. As far as I can see, the operations it supplies will adequately meet the demands. I will stay focus on this topic.

Best,
Xingcan

On Wed, Jan 11, 2017 at 7:28 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
(I'm just getting back from holidays, therefore the slow response. Sorry for that.)

I think you can simulate the way Storm windows work by using a GlobalWindows assigner and having a custom Trigger and/or Evictor and also some special logic in your WindowFunction.

About mergeable state, we're actually in the process of adding something like this that would be a generalisation of reduce and fold: you can call it combine or aggregate. The idea is to have these operations:

- create accumulator
- add value to accumulator
- merge accumulators
- extract output from accumulator

You have three types: IN for incoming values, ACC for accumulators and OUT as the result of extracting output from an accumulator. This should cover most cases.

What do you think?

Cheers,
Aljoscha

On Thu, 22 Dec 2016 at 07:13 xingcan <[hidden email]> wrote:
Hi Aljoscha,

First of all, sorry for that I missed your prompt reply : (

In these days, I've been learning the implementation mechanism of window in Flink.

I think the main difference between the window in Storm and Flink (from the API level) is that, Storm maintains only one window while Flink maintains several isolated windows. Due to that, Storm users can be aware of the transformation (tuple add and expire) of a window and take actions on each window modification (sliding window forwarding) while Flink users can only implement functions on one and another complete window as if they are independent of each other (actually they may get quite a few tuples in common).

Objectively speaking, the window API provided by Flink is more formalize and easy to use. However, for sliding window with high-capacity and short interval (e.g. 5m and 1s), each tuple will be calculated redundantly (maybe 300 times in the example?). Though it provide the pane optimization, I think it's far from enough as the optimization can only be applied on reduce functions which restrict the input and output data type to be the same. Some other functions, e.g., the MaxAndMin function which take numbers as input and output a max&min pair and the Average function, which should avoid redundant calculations can not be satisfied.

Actually, I just wondering if a "mergeable fold function" could be added (just *like* this https://en.wikipedia.org/wiki/Mergeable_heap). I know it may violate some principles of Flink (probably about states), but I insist that unnecessary calculations should be avoided in stream processing.

So, could you give some advices, I am all ears : ), or if you think that is feasible, I'll think carefully and try to complete it.

Thank you and merry Christmas.

Best,

- Xingcan

On Thu, Dec 1, 2016 at 7:56 PM, Aljoscha Krettek <[hidden email]> wrote:
I'm not aware of how windows work in Storm. If you could maybe give some details on your use case we could figure out together how that would map to Flink windows.

Cheers,
Aljoscha

On Tue, 29 Nov 2016 at 15:47 xingcan <[hidden email]> wrote:
Hi all,

Recently I tried to transfer some old applications from Storm to Flink.
In Storm, the window implementation (TupleWindow) gets two methods named getNew() and getExpired() which supply the delta information of a window and therefore we wrote some stateful caches that are aware of them.
However, it seems that Flink deals with the window in a different way and supplies more "formalized" APIs.
So, any tips on how to adapt these delta awareness caches in Flink or do some refactors to make them suitable?

Thanks.

Best,
Xingcan