Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer per day. I sink the files to s3. it suppose to work on around 500 files at the same time (according to my partitioning). I have a critical problem of 'Too many open files'. I've upload two taskmanagers, each with 16 slots. I've checked how many open files (or file descriptors) exist with 'lsof | wc -l' and it had reached over a million files on each taskmanager! after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager), and the concurrency dropped. checking 'lsof | wc -l' gave around 250k file on each machine. I also checked how many actual files exist in my tmp dir (it works on the files there before uploading them to s3) - around 3000. I think that each taskSlot works with several threads (maybe 16?), and each thread holds a fd for the actual file, and thats how the numbers get so high. Is that a know problem? is there anything I can do? by now, I filter just 10 customers and it works great, but I have to find a real solution so I can stream all the data. Maybe I can also work with a single task Slot per machine but I'm not sure this is a good idea. Thank you very much, Alon -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
I have seen this before as well.
My workaround was to limit the number of parallelism but it is the unfortunate effect of limiting the number of processing tasks also (and so slowing things down)
Another alternative is to have bigger buckets (and smaller number of buckets)
Not sure if there is a good solution.
From: galantaa <[hidden email]>
Sent: Tuesday, March 13, 2018 7:08:01 AM To: [hidden email] Subject: Too many open files on Bucketing sink Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer per day. I sink the files to s3. it suppose to work on around 500 files at the same time (according to my partitioning). I have a critical problem of 'Too many open files'. I've upload two taskmanagers, each with 16 slots. I've checked how many open files (or file descriptors) exist with 'lsof | wc -l' and it had reached over a million files on each taskmanager! after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager), and the concurrency dropped. checking 'lsof | wc -l' gave around 250k file on each machine. I also checked how many actual files exist in my tmp dir (it works on the files there before uploading them to s3) - around 3000. I think that each taskSlot works with several threads (maybe 16?), and each thread holds a fd for the actual file, and thats how the numbers get so high. Is that a know problem? is there anything I can do? by now, I filter just 10 customers and it works great, but I have to find a real solution so I can stream all the data. Maybe I can also work with a single task Slot per machine but I'm not sure this is a good idea. Thank you very much, Alon -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
There is an open similar issue: https://issues.apache.org/jira/browse/FLINK-8707 It’s still under investigation and it would be helpful if you could follow up the discussion there, run same diagnostics commands as Alexander Gardner did (mainly if you could attach output of lsof command for TaskManagers). Last time I was looking into it, most of the open files came from loading dependency jars for the operators. It seemed like each task/task slot was executed in separate class loader so the same dependency was being loaded multiple times over and over again. Thanks, Piotrek
|
Free forum by Nabble | Edit this page |