State snapshotting when source is finite

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

State snapshotting when source is finite

Flavio Pompermaier
Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and explode it into a set of RDF statements (as Tuples).  The objects we output are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we incrementally increase the state of those containers but we don't have to spend a lot of time performing some GET operation to an external Key-value store. 
The big problem here is that the sources are finite and the state of the job gets lost once the job ends, while I was expecting that Flink was snapshotting the state of its operators before exiting.

This idea was inspired by https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store, whit the difference that one can resume the state of the stateful application only when required.
Do you think that it could be possible to support such a use case (that we can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: State snapshotting when source is finite

Fabian Hueske-2
Hi Flavio,

Thanks for bringing up this topic.
I think running periodic jobs with state that gets restored and persisted in a savepoint is a very valid use case and would fit the stream is a superset of batch story quite well.
I'm not sure if this behavior is already supported, but think this would be a desirable feature.

I'm looping in Till and Aljoscha who might have some thoughts on this as well.
Depending on the discussion we should open a JIRA for this feature.

Cheers, Fabian

2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and explode it into a set of RDF statements (as Tuples).  The objects we output are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we incrementally increase the state of those containers but we don't have to spend a lot of time performing some GET operation to an external Key-value store. 
The big problem here is that the sources are finite and the state of the job gets lost once the job ends, while I was expecting that Flink was snapshotting the state of its operators before exiting.

This idea was inspired by https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store, whit the difference that one can resume the state of the stateful application only when required.
Do you think that it could be possible to support such a use case (that we can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: State snapshotting when source is finite

Till Rohrmann
Hi Flavio,

this kind of feature is indeed useful and currently not supported by Flink. I think, however, that this feature is a bit trickier to implement, because Tasks cannot currently initiate checkpoints/savepoints on their own. This would entail some changes to the lifecycle of a Task and an extra communication step with the JobManager. However, nothing impossible to do.

Please open a JIRA issue with the description of the problem where we can continue the discussion.

Cheers,
Till

On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Thanks for bringing up this topic.
I think running periodic jobs with state that gets restored and persisted in a savepoint is a very valid use case and would fit the stream is a superset of batch story quite well.
I'm not sure if this behavior is already supported, but think this would be a desirable feature.

I'm looping in Till and Aljoscha who might have some thoughts on this as well.
Depending on the discussion we should open a JIRA for this feature.

Cheers, Fabian

2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and explode it into a set of RDF statements (as Tuples).  The objects we output are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we incrementally increase the state of those containers but we don't have to spend a lot of time performing some GET operation to an external Key-value store. 
The big problem here is that the sources are finite and the state of the job gets lost once the job ends, while I was expecting that Flink was snapshotting the state of its operators before exiting.

This idea was inspired by https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store, whit the difference that one can resume the state of the stateful application only when required.
Do you think that it could be possible to support such a use case (that we can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: State snapshotting when source is finite

Flavio Pompermaier
Done: https://issues.apache.org/jira/browse/FLINK-7930

Best,
Flavio

On Thu, Oct 26, 2017 at 10:52 AM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

this kind of feature is indeed useful and currently not supported by Flink. I think, however, that this feature is a bit trickier to implement, because Tasks cannot currently initiate checkpoints/savepoints on their own. This would entail some changes to the lifecycle of a Task and an extra communication step with the JobManager. However, nothing impossible to do.

Please open a JIRA issue with the description of the problem where we can continue the discussion.

Cheers,
Till

On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Thanks for bringing up this topic.
I think running periodic jobs with state that gets restored and persisted in a savepoint is a very valid use case and would fit the stream is a superset of batch story quite well.
I'm not sure if this behavior is already supported, but think this would be a desirable feature.

I'm looping in Till and Aljoscha who might have some thoughts on this as well.
Depending on the discussion we should open a JIRA for this feature.

Cheers, Fabian

2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and explode it into a set of RDF statements (as Tuples).  The objects we output are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we incrementally increase the state of those containers but we don't have to spend a lot of time performing some GET operation to an external Key-value store. 
The big problem here is that the sources are finite and the state of the job gets lost once the job ends, while I was expecting that Flink was snapshotting the state of its operators before exiting.

This idea was inspired by https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store, whit the difference that one can resume the state of the stateful application only when required.
Do you think that it could be possible to support such a use case (that we can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio