Netty channel closed at AKKA gated status

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Netty channel closed at AKKA gated status

Wenrui Meng
We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Zhijiang(wangzhijiang999)
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)

Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Wenrui Meng
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)

Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Biao Liu
Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)

Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Zhijiang(wangzhijiang999)
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <[hidden email]>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <[hidden email]>
Cc:zhijiang <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)


Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Wenrui Meng
Looked at a few same instances. The lost task manager was indeed not active anymore since there is no log for that task manager printed after the connection issue timestamp. I guess somehow that task manager died silently without exception or termination relevant information logged. I double checked the lost task manager host, the GC, CPU, memory, network, disk I/O all look good without any spike. Is there any other possibility that the task manager can be terminated? We run our jobs in the yarn cluster. 

On Mon, Apr 15, 2019 at 10:47 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <[hidden email]>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <[hidden email]>
Cc:zhijiang <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)


Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Wenrui Meng
Attached the lost task manager last 10000 lines log. Anyone can help take a look? 

Thanks,
Wenrui

On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <[hidden email]> wrote:
Looked at a few same instances. The lost task manager was indeed not active anymore since there is no log for that task manager printed after the connection issue timestamp. I guess somehow that task manager died silently without exception or termination relevant information logged. I double checked the lost task manager host, the GC, CPU, memory, network, disk I/O all look good without any spike. Is there any other possibility that the task manager can be terminated? We run our jobs in the yarn cluster. 

On Mon, Apr 15, 2019 at 10:47 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <[hidden email]>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <[hidden email]>
Cc:zhijiang <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)



log_tail_10000 (989K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Zhijiang(wangzhijiang999)
Hi Wenrui,

I think you could trace the log of node manager which contains the lifecycle of this task executor. Maybe this task executor is killed by node manager because of memory overuse.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月20日(星期六) 09:48
To:zhijiang <[hidden email]>
Cc:Biao Liu <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Attached the lost task manager last 10000 lines log. Anyone can help take a look? 

Thanks,
Wenrui

On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <[hidden email]> wrote:
Looked at a few same instances. The lost task manager was indeed not active anymore since there is no log for that task manager printed after the connection issue timestamp. I guess somehow that task manager died silently without exception or termination relevant information logged. I double checked the lost task manager host, the GC, CPU, memory, network, disk I/O all look good without any spike. Is there any other possibility that the task manager can be terminated? We run our jobs in the yarn cluster. 

On Mon, Apr 15, 2019 at 10:47 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <[hidden email]>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <[hidden email]>
Cc:zhijiang <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)



Reply | Threaded
Open this post in threaded view
|

Re: Netty channel closed at AKKA gated status

Wenrui Meng
Thanks. We find the relevant nodemanager log and figured out the lost task manager killed by the yarn due to memory limit. [hidden email] [hidden email] Thanks for your help.

On Sun, Apr 21, 2019 at 11:45 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think you could trace the log of node manager which contains the lifecycle of this task executor. Maybe this task executor is killed by node manager because of memory overuse.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月20日(星期六) 09:48
To:zhijiang <[hidden email]>
Cc:Biao Liu <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Attached the lost task manager last 10000 lines log. Anyone can help take a look? 

Thanks,
Wenrui

On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <[hidden email]> wrote:
Looked at a few same instances. The lost task manager was indeed not active anymore since there is no log for that task manager printed after the connection issue timestamp. I guess somehow that task manager died silently without exception or termination relevant information logged. I double checked the lost task manager host, the GC, CPU, memory, network, disk I/O all look good without any spike. Is there any other possibility that the task manager can be terminated? We run our jobs in the yarn cluster. 

On Mon, Apr 15, 2019 at 10:47 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <[hidden email]>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <[hidden email]>
Cc:zhijiang <[hidden email]>; user <[hidden email]>; tzulitai <[hidden email]>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <[hidden email]> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <[hidden email]> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <[hidden email]>
Send Time:2019年4月13日(星期六) 01:01
To:user <[hidden email]>
Cc:tzulitai <[hidden email]>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)