Streaming Files to S3

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming Files to S3

Li Peng-2
Hey folks, I'm trying to stream large volume data and write them as csv files to S3, and one of the restrictions is to try and keep the files to below 100MB (compressed) and write one file per minute. I wanted to verify with you guys regarding my understanding of StreamingFileSink:

1. From the docs, StreamingFileSink will use multipart upload with s3, so even with many workers writing to s3, it will still output only one file for all of them for each time window, right?
2. StreamingFileSink.forRowFormat can be configured to write individual rows and then commit to disk as per the above rules, by specifying a RollingPolicy with the file size limit and the rollover interval, correct? And the limit and the interval applies to the entire file, not to each part file?
3. To write to s3, is it enough to just add flink-s3-fs-hadoop as a dependency and specify the file path as "s3://file"?

Thanks,
Li
Reply | Threaded
Open this post in threaded view
|

Re: Streaming Files to S3

Arvid Heise-3
Hi Li,

S3 file sink will write data into prefixes, with as many part-files as the degree of parallelism. This structure comes from the good ol' Hadoop days, where an output folder also contained part-files and is independent of S3. However, each of the part-files will be uploaded in a multipart fashion, which is S3 specific.

For your questions:
1. It will create one part-file for each parallel instance of the window operator. If you run the window operator and sink with parallelism of 1, you will receive exactly 1 file as you wished.
2. RollingPolicy can indeed be used, but again it's on part-file level. So you would need to use parallelism of 1 again. Also RollingPolicy will signal when the next part is started. So you probably need to roll when the file is something like 99 MB and hope that the last record will not go over 100 MB. Alternatively, you live with some files being a tad larger than 100 MB.
3. Yes, exactly. If you also want to use presto s3 system (in the future) for checkpoints, it's safer to specify "s3a://file".

Best,

Arvid

On Tue, Nov 26, 2019 at 2:59 AM Li Peng <[hidden email]> wrote:
Hey folks, I'm trying to stream large volume data and write them as csv files to S3, and one of the restrictions is to try and keep the files to below 100MB (compressed) and write one file per minute. I wanted to verify with you guys regarding my understanding of StreamingFileSink:

1. From the docs, StreamingFileSink will use multipart upload with s3, so even with many workers writing to s3, it will still output only one file for all of them for each time window, right?
2. StreamingFileSink.forRowFormat can be configured to write individual rows and then commit to disk as per the above rules, by specifying a RollingPolicy with the file size limit and the rollover interval, correct? And the limit and the interval applies to the entire file, not to each part file?
3. To write to s3, is it enough to just add flink-s3-fs-hadoop as a dependency and specify the file path as "s3://file"?

Thanks,
Li