I’m working with a kafka environment where I’m limited to 100 partitions @ 1GB log.retention.bytes each. I’m looking to implement exactly once processing from this kafka source to a S3 sink.
If I have understood correctly, Flink will only commit the kafka offsets when the data has been saved to S3.
Have I understood correctly that for Flink checkpoints and exactly once to work, the assumption is that the number and size of partitions (log.retention.bytes) in kafka are sufficient that should a checkpoint need to be rolled back, the data still exists in kafka (I.e. it hasn’t been over-written by new data)?
If the above is true, and I am using a DateTimeBucketer the bucket sizes will directly influence how big the partitions should be in kafka, because larger buckets will result in less frequent commits of the offsets?
Many thanks,
Chris