flink on my cluster gets stuck

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

flink on my cluster gets stuck

Attila Bernáth
Dear Developers,

I run some experiment on my cluster. I send the same job a couple of
times, and it is finished on the first 5-6 occasions, but the next one
fails and it gets stuck (the web dashboard stops moving on).

I use flink 0.7, compiled from source.

In the log file of one of my task managers I find the following
(similar message is written in every second, I only copy the last 2):

10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
          - An exceptionCaught() event was fired, and it reached at
the tail of the pipeline. It usually means the last handler in the
pipeline did not handle the exception.
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)
10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
          - An exceptionCaught() event was fired, and it reached at
the tail of the pipeline. It usually means the last handler in the
pipeline did not handle the exception.
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)

Any ideas what this can be?

Attila
Reply | Threaded
Open this post in threaded view
|

Re: flink on my cluster gets stuck

Ufuk Celebi
Hey Attila,

this means that your system is running out of file handles. Can you execute "ulimit -n" on your machines and report the value back? You will have to increase that value.

We actually multiplex multiple logical channels over the same TCP connection in order to reduce the number of concurrently open files handles. The problem, which leads to "too many open files" is that channels are not closed. Let me look into that and get back to you.

– Ufuk

On 21 Oct 2014, at 11:25, Attila Bernáth <[hidden email]> wrote:

> Dear Developers,
>
> I run some experiment on my cluster. I send the same job a couple of
> times, and it is finished on the first 5-6 occasions, but the next one
> fails and it gets stuck (the web dashboard stops moving on).
>
> I use flink 0.7, compiled from source.
>
> In the log file of one of my task managers I find the following
> (similar message is written in every second, I only copy the last 2):
>
> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
>          - An exceptionCaught() event was fired, and it reached at
> the tail of the pipeline. It usually means the last handler in the
> pipeline did not handle the exception.
> java.io.IOException: Too many open files
>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>        at java.lang.Thread.run(Thread.java:745)
> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
>          - An exceptionCaught() event was fired, and it reached at
> the tail of the pipeline. It usually means the last handler in the
> pipeline did not handle the exception.
> java.io.IOException: Too many open files
>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>        at java.lang.Thread.run(Thread.java:745)
>
> Any ideas what this can be?
>
> Attila

Reply | Threaded
Open this post in threaded view
|

Re: flink on my cluster gets stuck

Attila Bernáth
Dear Ufuk,

ulimit -n
says
8192

It seems that some of the task managers do not report a heartbeat
(this is what I find in the job managers log), and the job manager
fails to cancel the job.

Attila


2014-10-21 12:05 GMT+02:00 Ufuk Celebi <[hidden email]>:

> Hey Attila,
>
> this means that your system is running out of file handles. Can you execute "ulimit -n" on your machines and report the value back? You will have to increase that value.
>
> We actually multiplex multiple logical channels over the same TCP connection in order to reduce the number of concurrently open files handles. The problem, which leads to "too many open files" is that channels are not closed. Let me look into that and get back to you.
>
> – Ufuk
>
> On 21 Oct 2014, at 11:25, Attila Bernáth <[hidden email]> wrote:
>
>> Dear Developers,
>>
>> I run some experiment on my cluster. I send the same job a couple of
>> times, and it is finished on the first 5-6 occasions, but the next one
>> fails and it gets stuck (the web dashboard stops moving on).
>>
>> I use flink 0.7, compiled from source.
>>
>> In the log file of one of my task managers I find the following
>> (similar message is written in every second, I only copy the last 2):
>>
>> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>>
>> Any ideas what this can be?
>>
>> Attila
>
Reply | Threaded
Open this post in threaded view
|

Re: flink on my cluster gets stuck

rmetzger0
Were you able to increase the number of file handles in your cluster?

I think the TaskManager is not reporting any heartbeats because it is basically crashed once the "Too many open files" exception occured.

On Tue, Oct 21, 2014 at 3:56 AM, Attila Bernáth <[hidden email]> wrote:
Dear Ufuk,

ulimit -n
says
8192

It seems that some of the task managers do not report a heartbeat
(this is what I find in the job managers log), and the job manager
fails to cancel the job.

Attila


2014-10-21 12:05 GMT+02:00 Ufuk Celebi <[hidden email]>:
> Hey Attila,
>
> this means that your system is running out of file handles. Can you execute "ulimit -n" on your machines and report the value back? You will have to increase that value.
>
> We actually multiplex multiple logical channels over the same TCP connection in order to reduce the number of concurrently open files handles. The problem, which leads to "too many open files" is that channels are not closed. Let me look into that and get back to you.
>
> – Ufuk
>
> On 21 Oct 2014, at 11:25, Attila Bernáth <[hidden email]> wrote:
>
>> Dear Developers,
>>
>> I run some experiment on my cluster. I send the same job a couple of
>> times, and it is finished on the first 5-6 occasions, but the next one
>> fails and it gets stuck (the web dashboard stops moving on).
>>
>> I use flink 0.7, compiled from source.
>>
>> In the log file of one of my task managers I find the following
>> (similar message is written in every second, I only copy the last 2):
>>
>> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>>
>> Any ideas what this can be?
>>
>> Attila
>

Reply | Threaded
Open this post in threaded view
|

Re: flink on my cluster gets stuck

Attila Bernáth
Dear Robert!

I did not have this problem recently. If I run into it again I will
get back to this problem.

2014-10-31 6:41 GMT+01:00 Robert Metzger <[hidden email]>:
> Were you able to increase the number of file handles in your cluster?
Do you think that this is the solution? 8192 is already quite many.

Attila