If you are using both the Hadoop S3 and Presto S3 filesystems, you should use s3p:// and s3a:// to distinguish between the two.
Presto is recommended for checkpointing because the Hadoop implementation has very high latency when creating files, and because it hits request rate limits very quickly. The Hadoop S3 filesystem tries to imitate a normal filesystem on top of S3:
- before writing a key it checks if the "parent directory" exists by checking for a key with the prefix up to the last "/"
- it creates empty marker files to mark the existence of such a parent directory
- all these existence requests are S3 HEAD requests, which have very low request rate limits
David Anderson | Training Coordinator
Follow us @VervericaData
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time