How to process recent events from Kafka and older ones from another storage?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How to process recent events from Kafka and older ones from another storage?

Svend
Hi everyone,

What is the typical architectural approach with Flink SQL for processing recent events from Kafka and older events from some separate cheaper storage?

I currently have the following situation in mind:

* events are appearing in Kafka and retained there for, say, 1 month
* events are also regularly forwarded to AWS S3 into Parquet files

Given a use case, or a bugfix, that requires to (re)-process events older than 1 month, how do we design a Flink SQL pipeline that produces correct results?

* one possibility I envisaged is simply to first start the Flink SQL application using a FileSystem connector to read the events from parquet, then to shut it down while triggering a savepoint and finally resume it from that savepoint while now using the Kafka connector. I saw some presentations where engineers from Lyft were discussing that approach. For what I understand though, the FileSystem connector currently does not emit watermarks (I understand https://issues.apache.org/jira/browse/FLINK-21871 is addressing just that), so if I get things correctly that would imply that none of my Flink SQL code can depend on event time, which seems very restrictive.

* another option is to use infinite retention in Kafka, which is expensive, or to copy old events from S3 back to Kafka when we need to process them.

Since streaming in Flink, data retention in Kafka and pipeline backfilling are such common concepts, I am imagining that many teams are addressing the situation I'm describing above already.

What is the usual way of approaching this?

Thanks a lot in advance