Questions in sink exactly once implementation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions in sink exactly once implementation

徐涛
Hi
        I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
        1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
        2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

        I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery.

        My question is :
        1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
        2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.

Best
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Questions in sink exactly once implementation

Hequn Cheng
Hi Henry,

Yes, exactly once using atomic way is heavy for mysql. However, you don't have to buffer data if you choose option 2. You can simply overwrite old records with new ones if result data is idempotent and this way can also achieve exactly once. 
There is a document about End-to-End Exactly-Once Processing in Apache Flink[1], which may be helpful for you.

Best, Hequn




On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <[hidden email]> wrote:
Hi
        I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
        1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
        2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

        I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery.

        My question is :
        1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
        2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.

Best
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Questions in sink exactly once implementation

徐涛
Hi Hequn,
Thanks a lot for your response. I have a few questions about this topic. Would you please help me about it?
1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example?
2. If dirty data are added but not updated, then only overwrite is not enough I think.
3. If using two-phase commit, the sink must support transaction.
3.1 If the sink does not support transaction, for example, elasticsearch, do I have to use idempotent to implement exactly-once?
3.2 If the sink support transaction, for example, mysql, idempotent and two-phase commit is both OK. But like you say, if there are a lot of items between checkpoints, the batch insert is a heavy action, I still have to use idempotent way to implement exactly-once.


Best
Hequn

在 2018年10月13日,上午11:43,Hequn Cheng <[hidden email]> 写道:

Hi Henry,

Yes, exactly once using atomic way is heavy for mysql. However, you don't have to buffer data if you choose option 2. You can simply overwrite old records with new ones if result data is idempotent and this way can also achieve exactly once. 
There is a document about End-to-End Exactly-Once Processing in Apache Flink[1], which may be helpful for you.

Best, Hequn




On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <[hidden email]> wrote:
Hi
        I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
        1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
        2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

        I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery.

        My question is :
        1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
        2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.

Best
Henry

Reply | Threaded
Open this post in threaded view
|

Re: Questions in sink exactly once implementation

Hequn Cheng
Hi Henry,

> 1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example?
It's a property of the result data. For example, you can overwrite old values with new ones using a primary key.

> 2. If dirty data are added but not updated
This against idempotent. Idempotent ensure that the result is consistent in the end.

> 3. If using two-phase commit, the sink must support transaction.
I think the answer is yes.

Best, Hequn


On Sat, Oct 13, 2018 at 8:49 PM 徐涛 <[hidden email]> wrote:
Hi Hequn,
Thanks a lot for your response. I have a few questions about this topic. Would you please help me about it?
1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example?
2. If dirty data are added but not updated, then only overwrite is not enough I think.
3. If using two-phase commit, the sink must support transaction.
3.1 If the sink does not support transaction, for example, elasticsearch, do I have to use idempotent to implement exactly-once?
3.2 If the sink support transaction, for example, mysql, idempotent and two-phase commit is both OK. But like you say, if there are a lot of items between checkpoints, the batch insert is a heavy action, I still have to use idempotent way to implement exactly-once.


Best
Hequn

在 2018年10月13日,上午11:43,Hequn Cheng <[hidden email]> 写道:

Hi Henry,

Yes, exactly once using atomic way is heavy for mysql. However, you don't have to buffer data if you choose option 2. You can simply overwrite old records with new ones if result data is idempotent and this way can also achieve exactly once. 
There is a document about End-to-End Exactly-Once Processing in Apache Flink[1], which may be helpful for you.

Best, Hequn




On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <[hidden email]> wrote:
Hi
        I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
        1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
        2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

        I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery.

        My question is :
        1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
        2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.

Best
Henry