(DEPRECATED) Apache Flink User Mailing List archive.

Possible use case: Simulating iterative batch processing by rewinding source

Classic

List

Threaded

5 messages Options

Raul Kripalani

Possible use case: Simulating iterative batch processing by rewinding source

Hello,

I'm getting started with Flink for a use case that could leverage the window processing abilities of Flink that Spark does not offer.

Basically I have dumps of timeseries data (10y in ticks) which I need to calculate many metrics in an exploratory manner based on event time. NOTE: I don't have the metrics beforehand, it's gonna be an exploratory and iterative data analytics effort.

Flink doesn't seem to support windows on batch processing, so I'm thinking about emulating batch by using the Kafka stream connector and rewinding the data stream for every new metric that I calculate, to process the full timeseries series in a batch.

Each metric I calculate should in turn be sent to another Kafka topic so I can use it in a subsequent processing batch, e.g.

Iteration 1) raw timeseries data ---> metric1

Iteration 2) raw timeseries data + metric1 (composite) ---> metric2

Iteration 3) metric1 + metric2 ---> metric3

Iteration 4) raw timeseries data + metric3 ---> metric4

...

Does this sound like a usecase for Flink? Could you guide me a little bit on whether this is feasible currently?

Cheers,

Raúl Kripalani

PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and Messaging Engineer

http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani

Blog: raul.io | twitter: @raulvk

Christophe Salperwyck

Re: Possible use case: Simulating iterative batch processing by rewinding source

Hi,

I am interested too. For my part, I was thinking to use HBase as a backend so that my data are stored sorted. Nice to have to generate timeseries in the good order.

Cheers,

Christophe

2016-04-06 21:22 GMT+02:00 Raul Kripalani <[hidden email]>:

Hello,

I'm getting started with Flink for a use case that could leverage the window processing abilities of Flink that Spark does not offer.

Basically I have dumps of timeseries data (10y in ticks) which I need to calculate many metrics in an exploratory manner based on event time. NOTE: I don't have the metrics beforehand, it's gonna be an exploratory and iterative data analytics effort.

Flink doesn't seem to support windows on batch processing, so I'm thinking about emulating batch by using the Kafka stream connector and rewinding the data stream for every new metric that I calculate, to process the full timeseries series in a batch.

Each metric I calculate should in turn be sent to another Kafka topic so I can use it in a subsequent processing batch, e.g.

Iteration 1) raw timeseries data ---> metric1
Iteration 2) raw timeseries data + metric1 (composite) ---> metric2
Iteration 3) metric1 + metric2 ---> metric3
Iteration 4) raw timeseries data + metric3 ---> metric4
...

Does this sound like a usecase for Flink? Could you guide me a little bit on whether this is feasible currently?

Cheers,

Raúl Kripalani
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
Blog: raul.io | twitter: @raulvk

Raul Kripalani

Re: Possible use case: Simulating iterative batch processing by rewinding source

Hello,

Perhaps the description of use case wasn't clear enough? Please let me know.

Would appreciate the feedback of the community. Even if it's to inform that currently this iterative, batch, windowed approach is not possible, that's ok!

Cheers,

Raúl Kripalani

PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and Messaging Engineer

http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani

Blog: raul.io | twitter: @raulvk

On Wed, Apr 6, 2016 at 9:00 PM, Christophe Salperwyck <[hidden email]> wrote:

Hi,

I am interested too. For my part, I was thinking to use HBase as a backend so that my data are stored sorted. Nice to have to generate timeseries in the good order.

Cheers,
Christophe

2016-04-06 21:22 GMT+02:00 Raul Kripalani <[hidden email]>:
Hello,

I'm getting started with Flink for a use case that could leverage the window processing abilities of Flink that Spark does not offer.

Basically I have dumps of timeseries data (10y in ticks) which I need to calculate many metrics in an exploratory manner based on event time. NOTE: I don't have the metrics beforehand, it's gonna be an exploratory and iterative data analytics effort.

Flink doesn't seem to support windows on batch processing, so I'm thinking about emulating batch by using the Kafka stream connector and rewinding the data stream for every new metric that I calculate, to process the full timeseries series in a batch.

Each metric I calculate should in turn be sent to another Kafka topic so I can use it in a subsequent processing batch, e.g.

Iteration 1) raw timeseries data ---> metric1
Iteration 2) raw timeseries data + metric1 (composite) ---> metric2
Iteration 3) metric1 + metric2 ---> metric3
Iteration 4) raw timeseries data + metric3 ---> metric4
...

Does this sound like a usecase for Flink? Could you guide me a little bit on whether this is feasible currently?

Cheers,

Raúl Kripalani
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
Blog: raul.io | twitter: @raulvk

Ufuk Celebi

Re: Possible use case: Simulating iterative batch processing by rewinding source

On Mon, Apr 11, 2016 at 10:26 AM, Raul Kripalani <[hidden email]> wrote:
> Would appreciate the feedback of the community. Even if it's to inform that
> currently this iterative, batch, windowed approach is not possible, that's
> ok!

Hey Raul!

What you describe should work with Flink. This is actually the way to
go to replace your batch processor with a stream processor ;-). Rewind
your stream and re-run the streaming job. To get accurate and
repeatable results you have to work with event time. Otherwise, the
results for windows will vary from run to run.

The only problems I see concern how to control the different
iterations of your program. Do the iterations have to proceed one
after the other, e.g. first finish 1, then start 2, etc.?

– Ufuk

rmetzger0

Re: Possible use case: Simulating iterative batch processing by rewinding source

Flink's DataStream API also allows reading files from disk (local, hdfs, etc.). So you don't have to set up Kafka to make this work (If you have it already, you can of course use it).

On Mon, Apr 11, 2016 at 11:08 AM, Ufuk Celebi <[hidden email]> wrote:

On Mon, Apr 11, 2016 at 10:26 AM, Raul Kripalani <[hidden email]> wrote:
> Would appreciate the feedback of the community. Even if it's to inform that
> currently this iterative, batch, windowed approach is not possible, that's
> ok!

Hey Raul!

What you describe should work with Flink. This is actually the way to
go to replace your batch processor with a stream processor ;-). Rewind
your stream and re-run the streaming job. To get accurate and
repeatable results you have to work with event time. Otherwise, the
results for windows will vary from run to run.

The only problems I see concern how to control the different
iterations of your program. Do the iterations have to proceed one
after the other, e.g. first finish 1, then start 2, etc.?

– Ufuk