Question about concurrent checkpoints

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about concurrent checkpoints

Narendra Joshi
Hi,

How are concurrent snapshots taken for an operator?
Let's say an operator receives barriers for a checkpoint from all of
its inputs. It triggers the checkpoint. Now, the checkpoint starts
getting saved asynchronously. Before the checkpoint is acknowledged,
the operator receives all barriers for all inputs for the next
checkpoint. What will happen in this case if no concurrent checkpoints
are allowed (i.e. the default value is used)? What will happen if
concurrent checkpoints are allowed?

Thanks,
Narendra Joshi
Reply | Threaded
Open this post in threaded view
|

Re: Question about concurrent checkpoints

Nico Kruber
Hi Narendra,
according to [1], even with asynchronous state snapshots (see [2]), a
checkpoint is only complete after all sinks have received the barriers and all
(asynchronous) snapshots have been processed. Since, if the number of
concurrent checkpoints is 0, no checkpoint barriers will be emitted until the
previous checkpoint is complete (see [1]), you will not get into the situation
where two asynchronous snapshots are being taken concurrently.

If you enable concurrent checkpoints and asynchronous snapshots , they will
process concurrently but on different snapshots of the state, i.e. although
they are running in parallel, each stores the expected state.
If you want to know more about the details of how this is done, I can
recommend Stefan's (cc'd) talk at Flink Forward last week [4]. He may also be
able to answer in more detail in case I missed something.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/
stream_checkpointing.html#asynchronous-state-snapshots
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/ops/
state_backends.html
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
checkpointing.html
[4] https://www.youtube.com/watch?
v=dWQ24wERItM&index=36&list=PLDX4T_cnKjD0JeULl1X6iTn7VIkDeYX_X

On Thursday, 21 September 2017 13:26:17 CEST Narendra Joshi wrote:

> Hi,
>
> How are concurrent snapshots taken for an operator?
> Let's say an operator receives barriers for a checkpoint from all of
> its inputs. It triggers the checkpoint. Now, the checkpoint starts
> getting saved asynchronously. Before the checkpoint is acknowledged,
> the operator receives all barriers for all inputs for the next
> checkpoint. What will happen in this case if no concurrent checkpoints
> are allowed (i.e. the default value is used)? What will happen if
> concurrent checkpoints are allowed?
>
> Thanks,
> Narendra Joshi


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question about concurrent checkpoints

Narendra Joshi
Nico Kruber <[hidden email]> writes:

> Hi Narendra,
> according to [1], even with asynchronous state snapshots (see [2]), a
> checkpoint is only complete after all sinks have received the barriers and all
> (asynchronous) snapshots have been processed. Since, if the number of
> concurrent checkpoints is 0, no checkpoint barriers will be emitted until the
> previous checkpoint is complete (see [1]), you will not get into the situation
> where two asynchronous snapshots are being taken concurrently.
Does this mean that operators would stop processing streams (because
they received all barriers for a new checkpoint) and wait for
the ongoing asynchronous checkpoint to complete or it means that no
barriers would be injected into sources before checkpoint finishes?

> If you enable concurrent checkpoints and asynchronous snapshots , they will
> process concurrently but on different snapshots of the state, i.e. although
> they are running in parallel, each stores the expected state.
> If you want to know more about the details of how this is done, I can
> recommend Stefan's (cc'd) talk at Flink Forward last week [4]. He may also be
> able to answer in more detail in case I missed something.
Thanks for the reference to the talk! :)

>
>
> Nico
>
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/
> stream_checkpointing.html#asynchronous-state-snapshots
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/ops/
> state_backends.html
> [3] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
> checkpointing.html
> [4] https://www.youtube.com/watch?
> v=dWQ24wERItM&index=36&list=PLDX4T_cnKjD0JeULl1X6iTn7VIkDeYX_X
>
> On Thursday, 21 September 2017 13:26:17 CEST Narendra Joshi wrote:
>> Hi,
>>
>> How are concurrent snapshots taken for an operator?
>> Let's say an operator receives barriers for a checkpoint from all of
>> its inputs. It triggers the checkpoint. Now, the checkpoint starts
>> getting saved asynchronously. Before the checkpoint is acknowledged,
>> the operator receives all barriers for all inputs for the next
>> checkpoint. What will happen in this case if no concurrent checkpoints
>> are allowed (i.e. the default value is used)? What will happen if
>> concurrent checkpoints are allowed?
>>
>> Thanks,
>> Narendra Joshi
>

--
Narendra Joshi
Reply | Threaded
Open this post in threaded view
|

Re: Question about concurrent checkpoints

Nico Kruber
On Thursday, 21 September 2017 20:08:01 CEST Narendra Joshi wrote:

> Nico Kruber <[hidden email]> writes:
> > according to [1], even with asynchronous state snapshots (see [2]), a
> > checkpoint is only complete after all sinks have received the barriers and
> > all (asynchronous) snapshots have been processed. Since, if the number of
> > concurrent checkpoints is 0, no checkpoint barriers will be emitted until
> > the previous checkpoint is complete (see [1]), you will not get into the
> > situation where two asynchronous snapshots are being taken concurrently.
>
> Does this mean that operators would stop processing streams (because
> they received all barriers for a new checkpoint) and wait for
> the ongoing asynchronous checkpoint to complete or it means that no
> barriers would be injected into sources before checkpoint finishes?
The latter (as mentioned): no new barriers are injected into the sources.

The only thing that is waiting for asynchronous state snapshots to complete is
the checkpoint coordinator (in any case!) since a checkpoint is only complete
once all operators have stored their state. Operation continues as expected.


Nico

signature.asc (201 bytes) Download Attachment