Hi Chengzhi,
Yes, generally speaking, you would launch a separated job to do the backfilling, and then shut down the job after the backfilling is completed.
For this to work, you’ll also have to keep in mind that writes to the external sink must be idempotent.
Are you using Kafka as the data source?
If yes, then in Flink 1.5.0 we will be supporting specifying the startup position for the Flink Kafka Consumer using a specific timestamp.
You will not, however, be able to set an ending timestamp for the consumption. Therefore, what you could do, is to monitor whether or not the backfilling job has reached the head of the stream, and then close it.
If you are using Kinesis as the data source, then the Flink Kinesis connector already supports startup using timestamps (but again, cannot specify an ending timestamp).
I have been thinking about allowing users to set a set of ending partition offsets / ending timestamp so that when using the Kinesis / Kafka consumer it is easier to consume only a static set of data.
This might have been useful in your case, but as of now this isn’t on the roadmap yet.
Cheers,
Gordon
On 27 February 2018 at 7:07:37 AM, Chengzhi Zhao ([hidden email]) wrote:
Hey, flink community,
I have a question on backfill data and want to get some ideas
on how people think.
I have a stream of data using BucketingSink to S3 then to
Redshift. If something changed with the logic in flink and I need
to backfill some dates, for example, we are streaming data for
today but also need to backfill the data for 02/01/2018 -
02/10/2018.
What's the suggested way to implement it? Should I have a
separated process to backfill the data then close it?
Thanks,
Chengzhi