Hi Till,
Many thanks for your reply and don't worry. We understand this is tricky and you are busy.
We have been experiencing some issues, and a couple of them have been addressed, so the logs probably were not relevant anymore.
About losing jobs on restart -> it seems that YARN was killing the container for the master due to it not passing the liveness probe. Since Flink 1.1 or something we had been using very stringent liveness probe timeouts in Yarn to detect very fast when a node in the cluster was going out of service. This timeout (30 seconds) was probably killing the job manager before it was able to recover the ~40 streaming jobs that we run in session mode. I wonder why we had not seen that in 1.6, though, probably because of the legacy mode?
Extremely high unstability -> that was caused because we were running in DEBUG mode to capture logs and the sheer number of them (especially coming from AsyncFunctions) did cause the disks to fill and YARN to decomission the nodes. We do process many thousands of messages per second in some of our jobs.
We still have a few instances of Job Managers losing leadership every few hours (all of them in the cluster). Another of our jobs restarts more often, but the "Exceptions" tab in the UI for the job just tells us that "The assigned slot XXX was removed". It would be helpful to see why it was removed, though.
I am currently looking at those. But the logs don't tell me much (and I cannot run them in this environment with such a low level anymore). There is only one thing at ERROR level for when the more unstable job restarts:
java.util.concurrent.TimeoutException: Remote system has been silent for too long. (more than 48.0 hours),
at akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:376),
at akka.actor.Actor.aroundReceive(Actor.scala:502),
at akka.actor.Actor.aroundReceive$(Actor.scala:500),
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203),
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526),
at akka.actor.ActorCell.invoke(ActorCell.scala:495),
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257),
at akka.dispatch.Mailbox.run(Mailbox.scala:224),
at akka.dispatch.Mailbox.exec(Mailbox.scala:234),
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289),
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056),
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692),
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
And for the other, all the job managers losing leadership, just some warnings about the association with the remote system failing, ie:
Remote connection to [null] failed with java.net.ConnectException: Connection refused: ip-10-10-56-193.eu-west-1.compute.internal/10.10.56.193:43041 Association with remote system [akka.tcp://[hidden email]:43041] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]:43041]] Caused by: [Connection refused: ip-10-10-56-193.eu-west-1.compute.internal/10.10.56.193:43041]
over and over again...
Thanks for any insight.
Bruno