Reading files from an S3 folder

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading files from an S3 folder

Alex Reid
Hi, I've been playing around with using apache flink to process some data, and I'm starting out using the batch DataSet API. 

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");

Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel 20). It looks like each node is reading in ALL the files (whole folder), but what I really want is for each node/task to read in 1 file each and each process the data within the file they read in.
Is this expected behavior? Am I suppose to be doing something different here to get the results I want?
Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Reading files from an S3 folder

rmetzger0
Hi,
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?

Regards,
Robert

On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <[hidden email]> wrote:
Hi, I've been playing around with using apache flink to process some data, and I'm starting out using the batch DataSet API. 

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");

Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel 20). It looks like each node is reading in ALL the files (whole folder), but what I really want is for each node/task to read in 1 file each and each process the data within the file they read in.
Is this expected behavior? Am I suppose to be doing something different here to get the results I want?
Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Reading files from an S3 folder

Alex Reid
Each file is ~1.8G compressed (and about 15G uncompressed, so a little over 300G total for all the files).

In the Web Client UI, when I look at the Plan, I click on the subtask for reading in the files, I see a line for each host and the Bytes Sent for each host is like 350G.

The job takes longer than I'd expect, so just trying to track down where the time spent / is it doing what I'm expecting it to.

On Wed, Nov 23, 2016 at 8:45 AM, Robert Metzger <[hidden email]> wrote:
Hi,
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?

Regards,
Robert

On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <[hidden email]> wrote:
Hi, I've been playing around with using apache flink to process some data, and I'm starting out using the batch DataSet API. 

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");

Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel 20). It looks like each node is reading in ALL the files (whole folder), but what I really want is for each node/task to read in 1 file each and each process the data within the file they read in.
Is this expected behavior? Am I suppose to be doing something different here to get the results I want?
Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Reading files from an S3 folder

Steve Morin-2
In reply to this post by Alex Reid
Alex,
  We are working on the same thing, for the same exact reason.  We are trying to avoid the complexities of running HDFS just for the file storage.  We are also okay with the S3 limitations it introduces.  We'll try and update the group if we find solutions for parallelizing the files consumption.

Scott,
  What problems did you encounter trying to save flink state and savepoint data?  

Flink checkpoint & savepoint data. Like you, I tried using S3 to persist Flink state, but encountered AWS SDK issues and felt like I was going down an ill-advised path

-Steve


--
Steve Morin | Hacker, Entrepreneur, Startup Advisor 
Live the dream start a startup. Make the world ... a better place.