(DEPRECATED) Apache Flink User Mailing List archive.

Re: Task manager not able to rejoin job manager after network hicup

Posted by jelmer on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Task-manager-not-able-to-rejoin-job-manager-after-network-hicup-tp18525p18534.html

We found out there's a taskmanager.exit-on-fatal-akka-error property that will restart flink in this situation but it is not enabled by default and that feels like a rather blunt tool. I expect systems like this to be more resilient to this

On 23 February 2018 at 14:42, Aljoscha Krettek <[hidden email]> wrote:

@Till Is this the expected behaviour or do you suspect something could be going wrong?
On 23. Feb 2018, at 08:59, jelmer <[hidden email]> wrote:
We've observed on our flink 1.4.0 setup that if for some reason the networking between the task manager and the job manager gets disrupted then the task manager is never able to reconnect.

You'll end up with messages like this getting printed to the log repeatedly
Trying to register at JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 17, timeout: 30000 milliseconds)
Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not been restarted. Keeping it quarantined.
Or alternatively
Tried to associate with unreachable remote address [akka.tcp://flink@jobmanager:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.
But it never recovers until you either restart the job manager or the task manager

I was able to successfully reproduce this behaviour in two docker containers here :

https://github.com/jelmerk/flink-worker-not-rejoining

Has anyone else seen this problem ?