Last batch of stream data could not be sinked when data comes very slow

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Last batch of stream data could not be sinked when data comes very slow

徐涛
Hi Experts,
When we implement a sink, usually we implement a batch, according to the record number or when reaching a time interval, however this may lead to data of last batch do not write to sink. Because it is triggered by the incoming record.
I also test the JDBCOutputFormat provided by flink, and found that it also has the same problem. If the batch size is 50, and 49 items arrive, but the last one comes in an hour later, then the 49 items will not be written to sink during the one hour. This may cause data delay or data loss.
So should any pose a solution to this problem?
Thanks a lot.

Best
Henry 
Reply | Threaded
Open this post in threaded view
|

Re: Last batch of stream data could not be sinked when data comes very slow

Tzu-Li (Gordon) Tai
Hi Henry,

Flushing of buffered data in sinks should occur on two occasions - 1) when some buffer size limit is reached or a fixed-flush interval is fired, and 2) on checkpoints.

Flushing any pending data before completing a checkpoint ensures the sink has at-least-once guarantees, so that should answer your question about data loss.
For data delay due to the buffering, my only suggestion would be to have a time-interval based flushing configuration.
That is what is currently happening, for example, in the Kafka / Kinesis producer sinks. Records are buffered, and flushed at fixed intervals or when the buffer is full. They are also flushed on every checkpoint.

Cheers,
Gordon

On 13 November 2018 at 5:07:32 PM, 徐涛 ([hidden email]) wrote:

Hi Experts,
When we implement a sink, usually we implement a batch, according to the record number or when reaching a time interval, however this may lead to data of last batch do not write to sink. Because it is triggered by the incoming record.
I also test the JDBCOutputFormat provided by flink, and found that it also has the same problem. If the batch size is 50, and 49 items arrive, but the last one comes in an hour later, then the 49 items will not be written to sink during the one hour. This may cause data delay or data loss.
So should any pose a solution to this problem?
Thanks a lot.

Best
Henry 
Reply | Threaded
Open this post in threaded view
|

Re: Last batch of stream data could not be sinked when data comes very slow

徐涛
Hi Gordon,
Later I found the implementation of the Elastic Search Sink provided by Flink, and I found it also use the mechanism to flush the data when checkpoints happens. I apply the method, now the problem is solved. It uses exactly the method you have provided. Thanks a lot for your help.

Best
Henry

在 2018年11月14日,下午5:08,Tzu-Li (Gordon) Tai <[hidden email]> 写道:

Hi Henry,

Flushing of buffered data in sinks should occur on two occasions - 1) when some buffer size limit is reached or a fixed-flush interval is fired, and 2) on checkpoints.

Flushing any pending data before completing a checkpoint ensures the sink has at-least-once guarantees, so that should answer your question about data loss.
For data delay due to the buffering, my only suggestion would be to have a time-interval based flushing configuration.
That is what is currently happening, for example, in the Kafka / Kinesis producer sinks. Records are buffered, and flushed at fixed intervals or when the buffer is full. They are also flushed on every checkpoint.

Cheers,
Gordon

On 13 November 2018 at 5:07:32 PM, 徐涛 ([hidden email]) wrote:

Hi Experts,
When we implement a sink, usually we implement a batch, according to the record number or when reaching a time interval, however this may lead to data of last batch do not write to sink. Because it is triggered by the incoming record.
I also test the JDBCOutputFormat provided by flink, and found that it also has the same problem. If the batch size is 50, and 49 items arrive, but the last one comes in an hour later, then the 49 items will not be written to sink during the one hour. This may cause data delay or data loss.
So should any pose a solution to this problem?
Thanks a lot.

Best
Henry