|
Hi Guys,
For historical reprocessing , I am reading the avro data from S3 and passing these records to the same pipeline for processing.
I have the following queries:
1. I am running this pipeline as a stream application with checkpointing enabled, the records are successfully written to S3, however they remain in the pending state as checkpointing is not triggered when I doing re-processing. Why does this happen ? (kept the checkpointing interval to 1 minute, pipeline ran for 10 minutes)
this is the code I am using for reading avro data from S3
AvroInputFormat<SomeAvroClass> avroInputFormat = new AvroInputFormat<>(
new org.apache.flink.core.fs.Path(s3Path), SomeAvroClass.class);
sourceStream = env.createInput(avroInputFormat).map(...);
2. For the source stream Flink sets the parallelism as 1 , and for the rest of the operators the user specified parallelism is set. How does Flink reads the data ? does it bring the entire file from S3 one at a time and then Split it according to parallelism ?
3. I am reading from two different S3 folders and treating them as separate sourceStreams, how does Flink reads data in this case ? does it pick one file from each S3 folder , split the data and pass it downstream ? Does Flink reads the data sequentially ? I am confused here as only one Task Manager is reading the data from S3 and then all TM's are getting the data.
4. Although I am running this as as stream application, the operators goes into FINISHED state after processing , is this because Flink treats the S3 source as finite data ? What will happen if the data is continuously written to S3 from one pipeline and from the second pipeline I am doing historical re-processing ?
Regards,
Vinay Patil
|