Hi Experts,
When we implement a sink, usually we implement a batch, according to the record number or when reaching a time interval, however this may lead to data of last batch do not write to sink. Because it is triggered by the incoming record. I also test the JDBCOutputFormat provided by flink, and found that it also has the same problem. If the batch size is 50, and 49 items arrive, but the last one comes in an hour later, then the 49 items will not be written to sink during the one hour. This may cause data delay or data loss. So should any pose a solution to this problem? Thanks a lot. Best Henry
|
Hi Henry, Flushing of buffered data in sinks should occur on two occasions - 1) when some buffer size limit is reached or a fixed-flush interval is fired, and 2) on checkpoints. Flushing any pending data before completing a checkpoint ensures the sink has at-least-once guarantees, so that should answer your question about data loss. For data delay due to the buffering, my only suggestion would be to have a time-interval based flushing configuration. That is what is currently happening, for example, in the Kafka / Kinesis producer sinks. Records are buffered, and flushed at fixed intervals or when the buffer is full. They are also flushed on every checkpoint. Cheers, Gordon On 13 November 2018 at 5:07:32 PM, 徐涛 ([hidden email]) wrote:
|
Hi Gordon,
Later I found the implementation of the Elastic Search Sink provided by Flink, and I found it also use the mechanism to flush the data when checkpoints happens. I apply the method, now the problem is solved. It uses exactly the method you have provided. Thanks a lot for your help. Best Henry
|
Free forum by Nabble | Edit this page |