1) This could occur due to a number of
reasons, like processes crashing, network issues between ZK and
Flink, or the JobManager being stuck in some blocking operation
for a long time. You will need to take a look at the ZK/Flink logs
to narrow things down.
2) For FLINK-14091 the issue was not
just a ZK leader change but that the zookeeper connection was
suspended, i.e, the connection broke down. I'd think the best way
to replicate that is to shut down ZK for a bit, or make it
otherwise unreachable. To trigger a plain leader change the
easiest way would be to kill the leading JobManager.
On 3/3/2021 7:26 AM, Varun Chakravarthy
Senthilnathan wrote:
Hi,
We are using flink version 1.9.1 and in a
long-running environment, we encountered the specific issue
mentioned in : https://issues.apache.org/jira/browse/FLINK-14091
While we are working on upgrading our version,
-
Why does zookeeper go for a leader change? As far as we
checked, there was not scaling in our cluster at all. The
load was very minimal. Is there any reason for the zookeeper
leader change to happen?
-
is there a way to replicate the zookeeper leader change
manually to verify if the version upgrade helped us?
Regards,
Varun.