Best way to wait for different events

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to wait for different events

Lothium
Hi,
I have a question to a specific use case and hope that you can help me with
that. I have a streaming pipeline and receive events of different types. I
parse the events to their internal representation and do some
transformations on them. Some of these events I want to collect internally
(grouped by a transaction id) and as soon as a specific event arrives, I
want to emit for example an aggregated event to the downstream operators (so
this is not bound to time or a count).
I thought about keying the stream by some characteristic (so that all the
needed events are in the same logical partition), collect them in a stateful
process function and emit this after the specific event arrived.

Besides of the event types I also have to key the stream by a transaction
id, which all of these events are belong to (the transaction id is in all of
the events). The transaction id is unique and will only occur once, so I
will have a lot of unique short living keys.

I would clear the state of the process function after I have emitted the
aggregated event downstream, so this should hopefully release the state and
will clean it up. Is that correct? Would there be a problem, because of
these many keys that I would use (which will only be used once) or wouldn't
this be a problem and would Flink release the ressources (regarding memory
usage etc.)? Is this the right way to handle this use case?

Thanks!



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Best way to wait for different events

Chesnay Schepler
Your approach with ProcessFunction should work in general.

Can you guarantee that no event can arrive for a transaction for which
an aggregated event
was already emitted?

On 26.11.2017 18:22, Lothium wrote:

> Hi,
> I have a question to a specific use case and hope that you can help me with
> that. I have a streaming pipeline and receive events of different types. I
> parse the events to their internal representation and do some
> transformations on them. Some of these events I want to collect internally
> (grouped by a transaction id) and as soon as a specific event arrives, I
> want to emit for example an aggregated event to the downstream operators (so
> this is not bound to time or a count).
> I thought about keying the stream by some characteristic (so that all the
> needed events are in the same logical partition), collect them in a stateful
> process function and emit this after the specific event arrived.
>
> Besides of the event types I also have to key the stream by a transaction
> id, which all of these events are belong to (the transaction id is in all of
> the events). The transaction id is unique and will only occur once, so I
> will have a lot of unique short living keys.
>
> I would clear the state of the process function after I have emitted the
> aggregated event downstream, so this should hopefully release the state and
> will clean it up. Is that correct? Would there be a problem, because of
> these many keys that I would use (which will only be used once) or wouldn't
> this be a problem and would Flink release the ressources (regarding memory
> usage etc.)? Is this the right way to handle this use case?
>
> Thanks!
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply | Threaded
Open this post in threaded view
|

Re: Best way to wait for different events

Lothium
Thanks for you response!
Yes, I think to 99.9% there shouldn't be a "late event" and I would also
implement a logic in the ProcessFunction, which checks for a specific order
of the events per transaction id.

Using the clear() function for the state should free the ressources and
using that many short running keys shouldn't be a problem, correct? As far
as I understand it right the keyby function hashes the key and every key
(with the data) is assigned to a worker / thread. There is nothing that gets
persisted here so that a high number of short living unique keys shouldn't
be a problem, correct?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/