Flink 1.3.2 Netty Exception

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.3.2 Netty Exception

Flavio Pompermaier
Hi to all,
we wrote a small JUnit test to reproduce a memory issue we have in a Flink job (that seems related to Netty) . At some point, usually around the 28th loop, the job fails with the following exception (actually we never faced that in production but maybe is related to the memory issue somehow...):

Caused by: java.lang.IllegalAccessError: org/apache/flink/runtime/io/network/netty/NettyMessage
at io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
at io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
... 16 more

The github project is https://github.com/okkam-it/flink-memory-leak and the JUnit test is contained in the MemoryLeakTest class (within src/main/test).

Thanks in advance for any support,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3.2 Netty Exception

Chesnay Schepler
I can confirm that the issue is reproducible with the given test, from the command-line and IDE.

While cutting down the test case, by replacing the outputformat with a DiscardingOutputFormat and the JDBCInputFormat with a simple collection, i stumbled onto a new Exception after ~200 iterations:

org.apache.flink.runtime.client.JobExecutionException: Job execution failed.

Caused by: java.io.IOException: Insufficient number of network buffers: required 4, but only 1 available. The total number of network buffers is currently set to 5691 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:195)
	at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:186)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:602)
	at java.lang.Thread.run(Thread.java:745)

On 11.10.2017 12:48, Flavio Pompermaier wrote:
Hi to all,
we wrote a small JUnit test to reproduce a memory issue we have in a Flink job (that seems related to Netty) . At some point, usually around the 28th loop, the job fails with the following exception (actually we never faced that in production but maybe is related to the memory issue somehow...):

Caused by: java.lang.IllegalAccessError: org/apache/flink/runtime/io/network/netty/NettyMessage
at io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
at io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
... 16 more

The github project is https://github.com/okkam-it/flink-memory-leak and the JUnit test is contained in the MemoryLeakTest class (within src/main/test).

Thanks in advance for any support,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3.2 Netty Exception

Ufuk Celebi
@Chesnay: Recycling of network resources happens after the tasks go
into state FINISHED. Since we are submitting new jobs in a local loop
here it can easily happen that the new job is submitted before enough
buffers are available again. At least, previously that was the case.

I'm CC'ing Nico who refactored the network buffer distribution
recently and who might have more details about this specific error
message.

@Nico: another question is why there seem to be more buffers available
but we don't assign them. I'm referring to this part of the error
message "5691 of 32768 bytes...".

On Wed, Oct 11, 2017 at 2:54 PM, Chesnay Schepler <[hidden email]> wrote:

> I can confirm that the issue is reproducible with the given test, from the
> command-line and IDE.
>
> While cutting down the test case, by replacing the outputformat with a
> DiscardingOutputFormat and the JDBCInputFormat with a simple collection, i
> stumbled onto a new Exception after ~200 iterations:
>
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>
> Caused by: java.io.IOException: Insufficient number of network buffers:
> required 4, but only 1 available. The total number of network buffers is
> currently set to 5691 of 32768 bytes each. You can increase this number by
> setting the configuration keys 'taskmanager.network.memory.fraction',
> 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
> at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:195)
> at
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:186)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:602)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On 11.10.2017 12:48, Flavio Pompermaier wrote:
>
> Hi to all,
> we wrote a small JUnit test to reproduce a memory issue we have in a Flink
> job (that seems related to Netty) . At some point, usually around the 28th
> loop, the job fails with the following exception (actually we never faced
> that in production but maybe is related to the memory issue somehow...):
>
> Caused by: java.lang.IllegalAccessError:
> org/apache/flink/runtime/io/network/netty/NettyMessage
> at
> io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
> at
> io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
> ... 16 more
>
> The github project is https://github.com/okkam-it/flink-memory-leak and the
> JUnit test is contained in the MemoryLeakTest class (within src/main/test).
>
> Thanks in advance for any support,
> Flavio
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3.2 Netty Exception

Flavio Pompermaier
Any update on this? Do you want me to create a JIRA issue for this bug?

On 11 Oct 2017 17:14, "Ufuk Celebi" <[hidden email]> wrote:
@Chesnay: Recycling of network resources happens after the tasks go
into state FINISHED. Since we are submitting new jobs in a local loop
here it can easily happen that the new job is submitted before enough
buffers are available again. At least, previously that was the case.

I'm CC'ing Nico who refactored the network buffer distribution
recently and who might have more details about this specific error
message.

@Nico: another question is why there seem to be more buffers available
but we don't assign them. I'm referring to this part of the error
message "5691 of 32768 bytes...".

On Wed, Oct 11, 2017 at 2:54 PM, Chesnay Schepler <[hidden email]> wrote:
> I can confirm that the issue is reproducible with the given test, from the
> command-line and IDE.
>
> While cutting down the test case, by replacing the outputformat with a
> DiscardingOutputFormat and the JDBCInputFormat with a simple collection, i
> stumbled onto a new Exception after ~200 iterations:
>
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>
> Caused by: java.io.IOException: Insufficient number of network buffers:
> required 4, but only 1 available. The total number of network buffers is
> currently set to 5691 of 32768 bytes each. You can increase this number by
> setting the configuration keys 'taskmanager.network.memory.fraction',
> 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
>       at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:195)
>       at
> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:186)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:602)
>       at java.lang.Thread.run(Thread.java:745)
>
>
> On 11.10.2017 12:48, Flavio Pompermaier wrote:
>
> Hi to all,
> we wrote a small JUnit test to reproduce a memory issue we have in a Flink
> job (that seems related to Netty) . At some point, usually around the 28th
> loop, the job fails with the following exception (actually we never faced
> that in production but maybe is related to the memory issue somehow...):
>
> Caused by: java.lang.IllegalAccessError:
> org/apache/flink/runtime/io/network/netty/NettyMessage
> at
> io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
> at
> io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
> ... 16 more
>
> The github project is https://github.com/okkam-it/flink-memory-leak and the
> JUnit test is contained in the MemoryLeakTest class (within src/main/test).
>
> Thanks in advance for any support,
> Flavio
>
>