How do I backfill time series data?
Posted by
Marco Villalobos-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/How-do-I-backfill-time-series-data-tp35946.html
Hello Flink community. I need help. Thus far, Flink has proven very useful to me.
I am using it for stream processing of time-series data.
For the scope of this mailing list, let's say the time-series has the fields: name: String, value: double, and timestamp: Instant.
I named the time series: timeSeriesDataStream.
My first task was to average the time series by name within a 15 minute tumbling event time window.
\
I was able to solve this with a ProcessWindowFunction (had to use this approach because the watermark is not keyed), and named resultant stream: aggregateTimeSeriesDataStream, and then "sinking" the values.
My next task is to backfill the name averages on the subsequent. This means that if a time-series does not appear in a subsequent window then the previous average value will be used in that window.
How do I do this?
I started by performing a Map function on the aggregateTimeSeriesDataStream to change the timestamp back 15 minutes, and naming the resultant stream:
backfilledDataStream.
Now, I am stuck. I suspect that I either
1) timeSeriesDataStream.coGroup(backfilledDataStream) and add CoGroupWindowFunction to process the backfill.
2) Use "iterate" to somehow jury rig a backfill.
I really don't know. That's why I am asking this group for advice.
What's the common solution for this problem? I am quite sure that this is a very common use-case.