Flink job finished unexpected

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink job finished unexpected

rainieli
Hello,

I launched a job with a larger load on hadoop yarn cluster.
The Job finished after running 5 hours, I didn't find any error from JobManger log besides this connect exception. 

2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/10.1.57.146:48368] failed with java.io.IOException: Connection reset by peer
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:35241] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:38481] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

Any idea what caused the job to be finished and how to resolve it? 
Any suggestions are appreciated.

Thanks
Best regards
Rainie
Reply | Threaded
Open this post in threaded view
|

Re: Flink job finished unexpected

Arvid Heise-4
Hi Rainie,

there are two probably causes:
* Network instabilities
* Taskmanager died, then you can further dig in the taskmanager logs for errors right before that time.

In both cases, Flink should restart the job with the correct restart policies if configured.

On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <[hidden email]> wrote:
Hello,

I launched a job with a larger load on hadoop yarn cluster.
The Job finished after running 5 hours, I didn't find any error from JobManger log besides this connect exception. 

2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/10.1.57.146:48368] failed with java.io.IOException: Connection reset by peer
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:35241] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:38481] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

Any idea what caused the job to be finished and how to resolve it? 
Any suggestions are appreciated.

Thanks
Best regards
Rainie
Reply | Threaded
Open this post in threaded view
|

Re: Flink job finished unexpected

rainieli
I see, I will check tm log.
Thank you Arvid.

Best regards
Rainie

On Wed, Feb 24, 2021 at 5:27 AM Arvid Heise <[hidden email]> wrote:
Hi Rainie,

there are two probably causes:
* Network instabilities
* Taskmanager died, then you can further dig in the taskmanager logs for errors right before that time.

In both cases, Flink should restart the job with the correct restart policies if configured.

On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <[hidden email]> wrote:
Hello,

I launched a job with a larger load on hadoop yarn cluster.
The Job finished after running 5 hours, I didn't find any error from JobManger log besides this connect exception. 

2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/10.1.57.146:48368] failed with java.io.IOException: Connection reset by peer
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:35241] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@host:38481] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

Any idea what caused the job to be finished and how to resolve it? 
Any suggestions are appreciated.

Thanks
Best regards
Rainie