Reading multiple files from S3 source in parallel

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Reading multiple files from S3 source in parallel

Flink Developer
Hello,
I'm interested in creating a Flink batch app that can process multiple files from S3 source in parallel. Let's say I have the following S3 structure and that my Flink App has Parallelism set to 3 workers.
     s3://bucket/data-1/worker-1/file-1.txt
     s3://bucket/data-1/worker-1/file-2.txt
     s3://bucket/data-1/worker-2/file-1.txt
     s3://bucket/data-1/worker-2/file-2.txt
     s3://bucket/data-1/worker-3/file-1.txt
     s3://bucket/data-1/worker-3/file-2.txt

     s3://bucket/data-2/worker-1/file-1.txt
     s3://bucket/data-2/worker-1/file-2.txt
     s3://bucket/data-2/worker-2/file-1.txt
     s3://bucket/data-2/worker-2/file-2.txt
     s3://bucket/data-2/worker-3/file-1.txt
     s3://bucket/data-2/worker-3/file-2.txt

     s3://bucket/data-3/worker-1/file-1.txt
     s3://bucket/data-3/worker-1/file-2.txt
     s3://bucket/data-3/worker-2/file-1.txt
     s3://bucket/data-3/worker-2/file-2.txt
     s3://bucket/data-3/worker-3/file-1.txt
     s3://bucket/data-3/worker-3/file-2.txt

I'm interested in having the flink workers process in parallel. For example, flink worker #1 should process only these files and in this order:
     s3://bucket/data-1/worker-1/file-1.txt
     s3://bucket/data-1/worker-1/file-2.txt
     s3://bucket/data-2/worker-1/file-1.txt
     s3://bucket/data-2/worker-1/file-2.txt
     s3://bucket/data-3/worker-1/file-1.txt
     s3://bucket/data-3/worker-1/file-2.txt

How can I configure the data source to the Flink App to handle this? Thank you for your help.