Flink for historical time series processing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink for historical time series processing

Mindaugas Zickus
Hi All,



I wonder if Flink is a right tool for processing historical time series data e.g. many small files.

Our use case: we have clickstream histories (time series) of many users. We would like to calculate user specific sliding count window aggregates over past periods for a sample of users to create features to train machine learning models. 

As I see it, Flink would load user histories from some nosql database (e.g. hbase), process them and publish aggregates for machine learning. Flink also would update user histories with new events.

I wonder if its it equally efficient to load and process each user history in parallel or it's better to create one big dataset with multiple user histories and run single map-reduce task on it?

The first approach is more attractive since we could use same event aggregation code both for processing historical user data for training models and for aggregating real time user events into features for model execution.

thanks, Mindis



Reply | Threaded
Open this post in threaded view
|

Re: Flink for historical time series processing

Jamie Grier
Hi Mindis,

This does actually sound like a good use case for Flink.  Without knowing more details it's a bit hard to say which of the options you mention would be most efficient but my gut feeling is that the "one big dataset" approach would be the way to go.

I think there probably is a simplified workflow here where you could unify both the historical and realtime processing into a single Flink job.

-Jamie


On Tue, Jun 28, 2016 at 11:15 AM, Mindaugas Zickus <[hidden email]> wrote:
Hi All,



I wonder if Flink is a right tool for processing historical time series data e.g. many small files.

Our use case: we have clickstream histories (time series) of many users. We would like to calculate user specific sliding count window aggregates over past periods for a sample of users to create features to train machine learning models. 

As I see it, Flink would load user histories from some nosql database (e.g. hbase), process them and publish aggregates for machine learning. Flink also would update user histories with new events.

I wonder if its it equally efficient to load and process each user history in parallel or it's better to create one big dataset with multiple user histories and run single map-reduce task on it?

The first approach is more attractive since we could use same event aggregation code both for processing historical user data for training models and for aggregating real time user events into features for model execution.

thanks, Mindis






--

Jamie Grier
data Artisans, Director of Applications Engineering