Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem? Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, each 28G mem. image.png (229K) Download Attachment |
Hi Yidan, it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote: Attachment is the exception stack from flink's web-ui. Does anyone |
Hi, here is the text exception stack:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out Robert Metzger <[hidden email]> 于2021年6月16日周三 下午4:26写道: > > Hi Yidan, > it seems that the attachment did not make it through the mailing list. Can > you copy-paste the text of the exception here or upload the log somewhere? > > > > On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote: > > > Attachment is the exception stack from flink's web-ui. Does anyone > > have also met this problem? > > > > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > > each 28G mem. > > |
In reply to this post by yidan zhao
Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. Hope this helps. Best, Yingjie yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: Attachment is the exception stack from flink's web-ui. Does anyone |
2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the taskmanager, it is about 1000+, so I think it is a reasonable value. 1: can not be sure. Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > Hi yidan, > > 1. Is the network stable? > 2. Is there any GC problem? > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > Hope this helps. > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > Best, > Yingjie > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> have also met this problem? >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> each 28G mem. |
Hi, yingjie.
If the network is not stable, which config parameter I should adjust. yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is default > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > is 120s。 > 5: I checked the net fd number of the taskmanager, it is about 1000+, > so I think it is a reasonable value. > > 1: can not be sure. > > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > > > Hi yidan, > > > > 1. Is the network stable? > > 2. Is there any GC problem? > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > > > Hope this helps. > > > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > > > Best, > > Yingjie > > > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> > >> Attachment is the exception stack from flink's web-ui. Does anyone > >> have also met this problem? > >> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> each 28G mem. |
I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in 'LocalTransportException' or 'RemoteTransportException'. yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: > > Hi, yingjie. > If the network is not stable, which config parameter I should adjust. > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > > > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > > 142892, so it is not bad. > > 3: stream job. > > 4: I will try to config taskmanager.network.retries which is default > > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > > is 120s。 > > 5: I checked the net fd number of the taskmanager, it is about 1000+, > > so I think it is a reasonable value. > > > > 1: can not be sure. > > > > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > > > > > Hi yidan, > > > > > > 1. Is the network stable? > > > 2. Is there any GC problem? > > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > > > > > Hope this helps. > > > > > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > > > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > > > > > Best, > > > Yingjie > > > > > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > > >> > > >> Attachment is the exception stack from flink's web-ui. Does anyone > > >> have also met this problem? > > >> > > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > > >> each 28G mem. |
In reply to this post by yidan zhao
Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: Hi, yingjie. |
Ok, I will try.
Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午8:00写道: > > Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: >> >> Hi, yingjie. >> If the network is not stable, which config parameter I should adjust. >> >> yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> > >> > 2: I use G1, and no full gc occurred, young gc count: 422, time: >> > 142892, so it is not bad. >> > 3: stream job. >> > 4: I will try to config taskmanager.network.retries which is default >> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> > is 120s。 >> > 5: I checked the net fd number of the taskmanager, it is about 1000+, >> > so I think it is a reasonable value. >> > >> > 1: can not be sure. >> > >> > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> > > >> > > Hi yidan, >> > > >> > > 1. Is the network stable? >> > > 2. Is there any GC problem? >> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> > > >> > > Hope this helps. >> > > >> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> > > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> > > >> > > Best, >> > > Yingjie >> > > >> > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> > >> >> > >> Attachment is the exception stack from flink's web-ui. Does anyone >> > >> have also met this problem? >> > >> >> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> > >> each 28G mem. |
Free forum by Nabble | Edit this page |