Hello,
I launched a job with a larger load on hadoop yarn cluster. The Job finished after running 5 hours, I didn't find any error from JobManger log besides this connect exception. 2021-02-20 13:20:14,110 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [/10.1.57.146:48368] failed with java.io.IOException: Connection reset by peer 2021-02-20 13:20:14,110 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@host:35241] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@host:38481] has failed, address is now gated for [50] ms. Reason: [Disassociated] Any idea what caused the job to be finished and how to resolve it? Any suggestions are appreciated. Thanks Best regards Rainie |
Hi Rainie, there are two probably causes: * Network instabilities * Taskmanager died, then you can further dig in the taskmanager logs for errors right before that time. In both cases, Flink should restart the job with the correct restart policies if configured. On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <[hidden email]> wrote:
|
I see, I will check tm log. Thank you Arvid. Best regards Rainie On Wed, Feb 24, 2021 at 5:27 AM Arvid Heise <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |