Re: State snapshotting when source is finite

Posted by Till Rohrmann on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/State-snapshotting-when-source-is-finite-tp16398p16423.html

Hi Flavio,

this kind of feature is indeed useful and currently not supported by Flink. I think, however, that this feature is a bit trickier to implement, because Tasks cannot currently initiate checkpoints/savepoints on their own. This would entail some changes to the lifecycle of a Task and an extra communication step with the JobManager. However, nothing impossible to do.

Please open a JIRA issue with the description of the problem where we can continue the discussion.

Cheers,
Till

On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Thanks for bringing up this topic.
I think running periodic jobs with state that gets restored and persisted in a savepoint is a very valid use case and would fit the stream is a superset of batch story quite well.
I'm not sure if this behavior is already supported, but think this would be a desirable feature.

I'm looping in Till and Aljoscha who might have some thoughts on this as well.
Depending on the discussion we should open a JIRA for this feature.

Cheers, Fabian

2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and explode it into a set of RDF statements (as Tuples).  The objects we output are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we incrementally increase the state of those containers but we don't have to spend a lot of time performing some GET operation to an external Key-value store. 
The big problem here is that the sources are finite and the state of the job gets lost once the job ends, while I was expecting that Flink was snapshotting the state of its operators before exiting.

This idea was inspired by https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store, whit the difference that one can resume the state of the stateful application only when required.
Do you think that it could be possible to support such a use case (that we can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio