Barriers at work

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Barriers at work

Srikanth
Hello,

I was reading about Flink's checkpoint and wanted to check if I correctly understood the usage of barriers for exactly once processing.
 1) Operator does alignment by buffering records coming after a barrier until it receives barrier from all upstream operators instances.
 2) Barrier is always preceded by a watermark to trigger processing all windows that are complete.
 3) Records in windows that are not triggered are also saved as part of checkpoint. These windows are repopulated when restoring from checkpoints. 

In production setups, were there any cases where alignment during checkpointing caused unacceptable latency?
If so, is there a way to indicate say wait for a MAX 100 ms? That way we have exactly-once in most situations but prefer at least once over higher latency in corner cases.

Srikanth
Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Matthias J. Sax-2
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:

> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Stephan Ewen
Hi Srikanth!

That is an interesting idea.
I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.

I hope I'll be able to start with that design doc next week.

Greetings,
Stephan


On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <[hidden email]> wrote:
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth


Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Srikanth
Thanks Matthias & Stephan!

Yes, if we choose to fail checkpoint on expiry, we can restore from previous checkpoint.

Looking forward to read the new design proposal.

Srikanth


On Fri, May 13, 2016 at 8:09 AM, Stephan Ewen <[hidden email]> wrote:
Hi Srikanth!

That is an interesting idea.
I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.

I hope I'll be able to start with that design doc next week.

Greetings,
Stephan


On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <[hidden email]> wrote:
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth



Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Srikanth
In reply to this post by Stephan Ewen
I have a follow up. Is there a recommendation of list of knobs that can be tuned if at least once guarantee while handling failure is good enough?
For cases like alert generation, non idempotent sink, etc where the system can live with duplicates or has other mechanism to handle them.

Srikanth

On Fri, May 13, 2016 at 8:09 AM, Stephan Ewen <[hidden email]> wrote:
Hi Srikanth!

That is an interesting idea.
I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.

I hope I'll be able to start with that design doc next week.

Greetings,
Stephan


On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <[hidden email]> wrote:
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth



Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Stephan Ewen
You can use the checkpoint mode to "at least once".
That way, barriers never block.

On Fri, May 13, 2016 at 6:05 PM, Srikanth <[hidden email]> wrote:
I have a follow up. Is there a recommendation of list of knobs that can be tuned if at least once guarantee while handling failure is good enough?
For cases like alert generation, non idempotent sink, etc where the system can live with duplicates or has other mechanism to handle them.

Srikanth

On Fri, May 13, 2016 at 8:09 AM, Stephan Ewen <[hidden email]> wrote:
Hi Srikanth!

That is an interesting idea.
I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.

I hope I'll be able to start with that design doc next week.

Greetings,
Stephan


On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <[hidden email]> wrote:
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth




Reply | Threaded
Open this post in threaded view
|

Re: Barriers at work

Srikanth
Thanks. I didn't know we could set that.

On Fri, May 13, 2016 at 12:44 PM, Stephan Ewen <[hidden email]> wrote:
You can use the checkpoint mode to "at least once".
That way, barriers never block.

On Fri, May 13, 2016 at 6:05 PM, Srikanth <[hidden email]> wrote:
I have a follow up. Is there a recommendation of list of knobs that can be tuned if at least once guarantee while handling failure is good enough?
For cases like alert generation, non idempotent sink, etc where the system can live with duplicates or has other mechanism to handle them.

Srikanth

On Fri, May 13, 2016 at 8:09 AM, Stephan Ewen <[hidden email]> wrote:
Hi Srikanth!

That is an interesting idea.
I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.

I hope I'll be able to start with that design doc next week.

Greetings,
Stephan


On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <[hidden email]> wrote:
I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".


-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
>
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints.
>
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
>
> Srikanth