End-to-end exactly once from kafka source to S3 sink

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

End-to-end exactly once from kafka source to S3 sink

chris snow
I’m working with a kafka environment where I’m limited to 100 partitions @ 1GB log.retention.bytes each.  I’m looking to implement exactly once processing from this kafka source to a S3 sink.

If I have understood correctly, Flink will only commit the kafka offsets when the data has been saved to S3.  

Have I understood correctly that for Flink checkpoints and exactly once to work, the assumption is that the number and size of partitions (log.retention.bytes) in kafka are sufficient that should a checkpoint need to be rolled back, the data still exists in kafka (I.e. it hasn’t been over-written by new data)?

If the above is true, and I am using a DateTimeBucketer the bucket sizes will directly influence how big the partitions should be in kafka, because larger buckets will result in less frequent commits of the offsets?

Many thanks,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: End-to-end exactly once from kafka source to S3 sink

Hung
"Flink will only commit the kafka offsets when the data has been saved to S3"
-> no, you can check the BucketingSink code, and it would mean BucketingSink
depends on Kafka which is not reasonable.

Flink stores checkpoint in disk of each worker, not Kafka.
(KafkaStream, the other streaming API provided by Kafka, stores checkpoint
back to Kafka)

So, bucket size doesn't affect the commit frequency.

Best,

Sendoh



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/