Re: Kakfa batches
Posted by
vprabhu@gmail.com on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Kakfa-batches-tp8296p8304.html
Thanks for the reply.
Yeah I mean a manual commit in #3, this is because in this case the offsets would accurately reflect the number of messages processed.
My understanding is that the current checkpointing process commits the state of all the operators separately, the kafka connector will commit offset X while the window operator will checkpoint the messages with offset rangeĀ (X - N upto X). If the job fails now (yarn application failed due to some network issue, someone accidentally killed my yarn job etc) the restarting the job will start processing from offset X and messages that were checkpointed by the window operator are lost.
The reason we need the accuracy is because the down stream processes are batch oriented (typically process a 15 min bucket of data) and are triggered based on message timestamps exceeding a certain watermark. We have a separate store that maintains the relation between partition-offset and message-timestamp (these are ever increasing values). A check happens against this store to see if the offsets processed by flink job has crossed a certain timestamp.