(DEPRECATED) Apache Flink User Mailing List archive.

Too many open files on Bucketing sink

Classic

List

Threaded

3 messages Options

galantaa

Too many open files on Bucketing sink

Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer
per day.
I sink the files to s3.
it suppose to work on around 500 files at the same time (according to my
partitioning).

I have a critical problem of 'Too many open files'.
I've upload two taskmanagers, each with 16 slots. I've checked how many open
files (or file descriptors) exist with 'lsof | wc -l' and it had reached
over a million files on each taskmanager!

after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
and the concurrency dropped.
checking 'lsof | wc -l' gave around 250k file on each machine.
I also checked how many actual files exist in my tmp dir (it works on the
files there before uploading them to s3) - around 3000.

I think that each taskSlot works with several threads (maybe 16?), and each
thread holds a fd for the actual file, and thats how the numbers get so
high.

Is that a know problem? is there anything I can do?
by now, I filter just 10 customers and it works great, but I have to find a
real solution so I can stream all the data.
Maybe I can also work with a single task Slot per machine but I'm not sure
this is a good idea.

Thank you very much,
Alon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Felix Cheung

Re: Too many open files on Bucketing sink

I have seen this before as well.

My workaround was to limit the number of parallelism but it is the unfortunate effect of limiting the number of processing tasks also (and so slowing things down)

Another alternative is to have bigger buckets (and smaller number of buckets)

Not sure if there is a good solution.

From: galantaa <[hidden email]>
Sent: Tuesday, March 13, 2018 7:08:01 AM
To: [hidden email]
Subject: Too many open files on Bucketing sink

Piotr Nowojski

Re: Too many open files on Bucketing sink

Hi,

There is an open similar issue: https://issues.apache.org/jira/browse/FLINK-8707

It’s still under investigation and it would be helpful if you could follow up the discussion there, run same diagnostics commands as Alexander Gardner did (mainly if you could attach output of lsof command for TaskManagers).

Last time I was looking into it, most of the open files came from loading dependency jars for the operators. It seemed like each task/task slot was executed in separate class loader so the same dependency was being loaded multiple times over and over again.

Thanks, Piotrek

On 14 Mar 2018, at 19:52, Felix Cheung <[hidden email]> wrote:

I have seen this before as well.

My workaround was to limit the number of parallelism but it is the unfortunate effect of limiting the number of processing tasks also (and so slowing things down)

Another alternative is to have bigger buckets (and smaller number of buckets)

Not sure if there is a good solution.

From: galantaa <[hidden email]>
Sent: Tuesday, March 13, 2018 7:08:01 AM
To: [hidden email]
Subject: Too many open files on Bucketing sink

Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer
per day.
I sink the files to s3.
it suppose to work on around 500 files at the same time (according to my
partitioning).

I have a critical problem of 'Too many open files'.
I've upload two taskmanagers, each with 16 slots. I've checked how many open
files (or file descriptors) exist with 'lsof | wc -l' and it had reached
over a million files on each taskmanager!

after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
and the concurrency dropped.
checking 'lsof | wc -l' gave around 250k file on each machine.
I also checked how many actual files exist in my tmp dir (it works on the
files there before uploading them to s3) - around 3000.

I think that each taskSlot works with several threads (maybe 16?), and each
thread holds a fd for the actual file, and thats how the numbers get so
high.

Is that a know problem? is there anything I can do?
by now, I filter just 10 customers and it works great, but I have to find a
real solution so I can stream all the data.
Maybe I can also work with a single task Slot per machine but I'm not sure
this is a good idea.

Thank you very much,
Alon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/