(DEPRECATED) Apache Flink User Mailing List archive.

flink job exception analysis (netty related, readAddress failed. connection timed out)

Classic

List

Threaded

9 messages Options

yidan zhao

flink job exception analysis (netty related, readAddress failed. connection timed out)

Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
each 28G mem.

image.png (229K) Download Attachment

rmetzger0

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi Yidan,

it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere?

On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:

Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
each 28G mem.

yidan zhao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi, here is the text exception stack:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection timed out

Robert Metzger <[hidden email]> 于2021年6月16日周三下午4:26写道：

>
> Hi Yidan,
> it seems that the attachment did not make it through the mailing list. Can
> you copy-paste the text of the exception here or upload the log somewhere?
>
>
>
> On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:
>
> > Attachment is the exception stack from flink's web-ui. Does anyone
> > have also met this problem?
> >
> > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> > each 28G mem.
> >

Yingjie Cao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

In reply to this post by yidan zhao

Hi yidan,

1. Is the network stable?

2. Is there any GC problem?

3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.

4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].

5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.

Hope this helps.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/

[3] https://issues.apache.org/jira/browse/FLINK-22643

Best,

Yingjie

yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：

Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
each 28G mem.

yidan zhao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.

1: can not be sure.

Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：

>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> each 28G mem.

yidan zhao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：

>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> each 28G mem.

yidan zhao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in 'LocalTransportException' or
'RemoteTransportException'.

yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：

>
> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> > >> each 28G mem.

Yingjie Cao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

In reply to this post by yidan zhao

Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.

yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：

Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> each 28G mem.

yidan zhao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Ok, I will try.

Yingjie Cao <[hidden email]> 于2021年6月16日周三下午8:00写道：

>
> Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：
>>
>> Hi, yingjie.
>> If the network is not stable, which config parameter I should adjust.
>>
>> yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>> >
>> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> > 142892, so it is not bad.
>> > 3: stream job.
>> > 4: I will try to config taskmanager.network.retries which is default
>> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> > is 120s。
>> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> > so I think it is a reasonable value.
>> >
>> > 1: can not be sure.
>> >
>> > Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> > >
>> > > Hi yidan,
>> > >
>> > > 1. Is the network stable?
>> > > 2. Is there any GC problem?
>> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> > >
>> > > Hope this helps.
>> > >
>> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> > >
>> > > Best,
>> > > Yingjie
>> > >
>> > > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> > >>
>> > >> Attachment is the exception stack from flink's web-ui. Does anyone
>> > >> have also met this problem?
>> > >>
>> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> > >> each 28G mem.