Re: HDFS append
Posted by
rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/HDFS-append-tp530p547.html
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.
Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.
Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.
Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)