Reading files from multiple subdirectories

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading files from multiple subdirectories

Lorenzo Nicora
Hi,

related to the same case I am discussing in another thread, but not related to AVRO this time :) 

I need to ingest files a S3 Sink Kafka Connector periodically adds to an S3 bucket.
Files are bucketed by date time as it often happens.

Is there any way, using Flink only, to monitor a base-path and detect new files in any subdirectories? 
Or I need to use something external to move new files in a single directory?

I am currently using  
env.readFile(inputFormat, path, PROCESS_CONTINUOUSLY, 60000)
with AvroInputFormat, but it seems it can only monitor a single directory


Cheers
Lorenzo
Reply | Threaded
Open this post in threaded view
|

Re: Reading files from multiple subdirectories

Yun Gao
Hi Lorenzo,

    Read from a previouse thread [1] and the source code, I think you may set inputFormat.setNestedFileEnumeration(true) to also scan the nested files.

Best,
Yun




------------------------------------------------------------------
Sender:Lorenzo Nicora<[hidden email]>
Date:2020/06/11 21:31:20
Recipient:user<[hidden email]>
Theme:Reading files from multiple subdirectories

Hi,

related to the same case I am discussing in another thread, but not related to AVRO this time :) 

I need to ingest files a S3 Sink Kafka Connector periodically adds to an S3 bucket.
Files are bucketed by date time as it often happens.

Is there any way, using Flink only, to monitor a base-path and detect new files in any subdirectories? 
Or I need to use something external to move new files in a single directory?

I am currently using  
env.readFile(inputFormat, path, PROCESS_CONTINUOUSLY, 60000)
with AvroInputFormat, but it seems it can only monitor a single directory


Cheers
Lorenzo