Hi,
I have a question to a specific use case and hope that you can help me with that. I have a streaming pipeline and receive events of different types. I parse the events to their internal representation and do some transformations on them. Some of these events I want to collect internally (grouped by a transaction id) and as soon as a specific event arrives, I want to emit for example an aggregated event to the downstream operators (so this is not bound to time or a count). I thought about keying the stream by some characteristic (so that all the needed events are in the same logical partition), collect them in a stateful process function and emit this after the specific event arrived. Besides of the event types I also have to key the stream by a transaction id, which all of these events are belong to (the transaction id is in all of the events). The transaction id is unique and will only occur once, so I will have a lot of unique short living keys. I would clear the state of the process function after I have emitted the aggregated event downstream, so this should hopefully release the state and will clean it up. Is that correct? Would there be a problem, because of these many keys that I would use (which will only be used once) or wouldn't this be a problem and would Flink release the ressources (regarding memory usage etc.)? Is this the right way to handle this use case? Thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Your approach with ProcessFunction should work in general.
Can you guarantee that no event can arrive for a transaction for which an aggregated event was already emitted? On 26.11.2017 18:22, Lothium wrote: > Hi, > I have a question to a specific use case and hope that you can help me with > that. I have a streaming pipeline and receive events of different types. I > parse the events to their internal representation and do some > transformations on them. Some of these events I want to collect internally > (grouped by a transaction id) and as soon as a specific event arrives, I > want to emit for example an aggregated event to the downstream operators (so > this is not bound to time or a count). > I thought about keying the stream by some characteristic (so that all the > needed events are in the same logical partition), collect them in a stateful > process function and emit this after the specific event arrived. > > Besides of the event types I also have to key the stream by a transaction > id, which all of these events are belong to (the transaction id is in all of > the events). The transaction id is unique and will only occur once, so I > will have a lot of unique short living keys. > > I would clear the state of the process function after I have emitted the > aggregated event downstream, so this should hopefully release the state and > will clean it up. Is that correct? Would there be a problem, because of > these many keys that I would use (which will only be used once) or wouldn't > this be a problem and would Flink release the ressources (regarding memory > usage etc.)? Is this the right way to handle this use case? > > Thanks! > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > |
Thanks for you response!
Yes, I think to 99.9% there shouldn't be a "late event" and I would also implement a logic in the ProcessFunction, which checks for a specific order of the events per transaction id. Using the clear() function for the state should free the ressources and using that many short running keys shouldn't be a problem, correct? As far as I understand it right the keyby function hashes the key and every key (with the data) is assigned to a worker / thread. There is nothing that gets persisted here so that a high number of short living unique keys shouldn't be a problem, correct? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |