I have a scenario where multiple small files need to be written on S3. I'm using TwoPhaseCommit sink since I have a specific scenario where I can't use StreamingFileSink.
I've notice that because the way the S3 write is done (sequencially), the checkpoint is timining out (10 minutes), because it takes too much time to write multiple files in S3. I've search for a bit and found this documentation, https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html Should this be the best way to try to write multiple files in S3 to not wait for one file to be completed, in order to write the next one ? Thanks! |
Hi David, I assume that you have written your own TwoPhaseCommitSink which writes to S3, right? If that is the case, then it is mainly up to your implementation how it writes files to S3. If your S3 client supports uploading multiple files concurrently, then you should go for it. Async I/O won't help you much in this scenario if you have strict exactly once guarantees. If you can tolerate at least once guarantees, then you could try to build an async operator which writes files to S3. But you could do the same in your custom TwoPhaseCommitSink implementations by spawning a ThreadPool and submitting multiple write operations. Cheers, Till On Fri, Apr 3, 2020 at 2:21 PM David Magalhães <[hidden email]> wrote:
|
Thanks for your feedback Till. I think in this scenario the best approach is to go into the ThreadPool. On Fri, Apr 3, 2020 at 1:47 PM Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |