|
Hi,
I did write recently about our problems with 1.7.2 for which we still haven't found a solution and the cluster is very unstable. I am trying to point now to a different problem that maybe it is related somehow and we don't understand.
When we restart a Flink Session in Yarn, we see it takes a few attempts in order for the container with the JM to be stable. The following Gist contains the logs from the 4 attempts before a 5th successful one:
We fail to see why the JM fails. In the first case, I can see a SIGTERM 15, so I assume it is the cluster manager killing it or something, but I am not sure what happens in the other cases, or why would the manager kill that container. We run 38 streaming jobs and we are using the same resources that we were using before with Flink 1.6 (for which we were using legacy mode).
Thanks for any insights. We are losing a lot of hair with 1.7.2...
Cheers,
Bruno
|