Parallisation of S3 write sink

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallisation of S3 write sink

David Magalhães
I have a scenario where multiple small files need to be written on S3. I'm using TwoPhaseCommit sink since I have a specific scenario where I can't use StreamingFileSink. 

I've notice that because the way the S3 write is done (sequencially), the checkpoint is timining out (10 minutes), because it takes too much time to write multiple files in S3. I've search for a bit and found this documentation, https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html

Should this be the best way to try to write multiple files in S3 to not wait for one file to be completed, in order to write the next one ?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Parallisation of S3 write sink

Till Rohrmann
Hi David,

I assume that you have written your own TwoPhaseCommitSink which writes to S3, right? If that is the case, then it is mainly up to your implementation how it writes files to S3. If your S3 client supports uploading multiple files concurrently, then you should go for it.

Async I/O won't help you much in this scenario if you have strict exactly once guarantees. If you can tolerate at least once guarantees, then you could try to build an async operator which writes files to S3. But you could do the same in your custom TwoPhaseCommitSink implementations by spawning a ThreadPool and submitting multiple write operations.

Cheers,
Till

On Fri, Apr 3, 2020 at 2:21 PM David Magalhães <[hidden email]> wrote:
I have a scenario where multiple small files need to be written on S3. I'm using TwoPhaseCommit sink since I have a specific scenario where I can't use StreamingFileSink. 

I've notice that because the way the S3 write is done (sequencially), the checkpoint is timining out (10 minutes), because it takes too much time to write multiple files in S3. I've search for a bit and found this documentation, https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html

Should this be the best way to try to write multiple files in S3 to not wait for one file to be completed, in order to write the next one ?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Parallisation of S3 write sink

David Magalhães
Thanks for your feedback Till. I think in this scenario the best approach is to go into the ThreadPool.

On Fri, Apr 3, 2020 at 1:47 PM Till Rohrmann <[hidden email]> wrote:
Hi David,

I assume that you have written your own TwoPhaseCommitSink which writes to S3, right? If that is the case, then it is mainly up to your implementation how it writes files to S3. If your S3 client supports uploading multiple files concurrently, then you should go for it.

Async I/O won't help you much in this scenario if you have strict exactly once guarantees. If you can tolerate at least once guarantees, then you could try to build an async operator which writes files to S3. But you could do the same in your custom TwoPhaseCommitSink implementations by spawning a ThreadPool and submitting multiple write operations.

Cheers,
Till

On Fri, Apr 3, 2020 at 2:21 PM David Magalhães <[hidden email]> wrote:
I have a scenario where multiple small files need to be written on S3. I'm using TwoPhaseCommit sink since I have a specific scenario where I can't use StreamingFileSink. 

I've notice that because the way the S3 write is done (sequencially), the checkpoint is timining out (10 minutes), because it takes too much time to write multiple files in S3. I've search for a bit and found this documentation, https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html

Should this be the best way to try to write multiple files in S3 to not wait for one file to be completed, in order to write the next one ?

Thanks!