FlinkKafkaConsumer - Broadcast - Initial Load

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

FlinkKafkaConsumer - Broadcast - Initial Load

Sandeep khanzode
Hi,

I have a master/reference data that needs to come in through a FlinkKafkaConsumer to be broadcast to all nodes and subsequently joined with the actual stream for enriching content.

The Kafka consumer gets CDC-type records from database changes. All this works well.


My question is how do I initialise the pipeline for the first set of records in the database? i.e. those that are not CDC?

When I deploy for the first time, I would need all the DB records to be sent to the FlinkKafkaConsumer before any CDC updates happen.

Is there a hook that allows for the first time initial load of the records in the Kafka topic to be broadcast?



Also, about the broadcast state, since we are persisting the state in RocksDB backend, I am assuming that the state backend would have the latest records and even if the task manager crashes and restarts, we will have the correct Kafka consumer group topic offsets recorded so that the next time, we do not “startFromEarliest”? Is that right?

Will the state always maintain the updates to the records as well as the Kafka topic offsets?


Thanks,
Sandeep
Reply | Threaded
Open this post in threaded view
|

Re: FlinkKafkaConsumer - Broadcast - Initial Load

rmetzger0
Hey Sandeep,


> My question is how do I initialise the pipeline for the first set of records in the database? i.e. those that are not CDC?

I would use the open() method of your process function to load the initial data in all operator instances.

>  I am assuming that the state backend would have the latest records and even if the task manager crashes and restarts, we will have the correct Kafka consumer group topic offsets recorded so that the next time, we do not “startFromEarliest”? Is that right?

Correct. The consumer offsets are stored in Flink state, and checkpointed with all the other state. On failure, the latest stable checkpoint is restored (which includes the Kafka offsets, where we start processing from).



On Thu, Mar 25, 2021 at 7:53 PM Sandeep khanzode <[hidden email]> wrote:
Hi,

I have a master/reference data that needs to come in through a FlinkKafkaConsumer to be broadcast to all nodes and subsequently joined with the actual stream for enriching content.

The Kafka consumer gets CDC-type records from database changes. All this works well.


My question is how do I initialise the pipeline for the first set of records in the database? i.e. those that are not CDC?

When I deploy for the first time, I would need all the DB records to be sent to the FlinkKafkaConsumer before any CDC updates happen.

Is there a hook that allows for the first time initial load of the records in the Kafka topic to be broadcast?



Also, about the broadcast state, since we are persisting the state in RocksDB backend, I am assuming that the state backend would have the latest records and even if the task manager crashes and restarts, we will have the correct Kafka consumer group topic offsets recorded so that the next time, we do not “startFromEarliest”? Is that right?

Will the state always maintain the updates to the records as well as the Kafka topic offsets?


Thanks,
Sandeep