Find the running median from a data stream

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Find the running median from a data stream

gdibernardo
Hi guys,

I want to keep track of the running median of a keyed data stream. I was considering to apply a RichMapFunction to the stream and store in a ValueState object two heaps (PriorityQueue) in order to find the running median. However, I am not really sure if this is the best approach performance-wise. Do you have some suggestions for me or have you ever done something similar?

Thank you so much!

Best,


Gabriele
Reply | Threaded
Open this post in threaded view
|

Re: Find the running median from a data stream

Fabian Hueske-2
Hi Gabriele,

I don't think you can compute the exact running median on a stream.
This would require to collect all elements of the stream so you would basically need to put the complete stream into the ValueState.
Even if the state is backed by RocksDB, the state for a specific key needs to fit on the heap when it is accessed. Moreover, the ValueState would be completely serialized and deserialized for every access which is of course expensive if the whole stream is materialized.

So, computing an exact running median requires a lot of storage and is expensive to compute.
You might be able to implement an approximate median using adaptive histograms.

Best, Fabian

2017-07-23 13:28 GMT+02:00 Gabriele Di Bernardo <[hidden email]>:
Hi guys,

I want to keep track of the running median of a keyed data stream. I was considering to apply a RichMapFunction to the stream and store in a ValueState object two heaps (PriorityQueue) in order to find the running median. However, I am not really sure if this is the best approach performance-wise. Do you have some suggestions for me or have you ever done something similar?

Thank you so much!

Best,


Gabriele