Two things: 1. It would be beneficial I think to drop a line somewhere in the
docs (probably on the production ready checklist as well as the HA
page) explaining that enabling zookeeper "highavailability" allows
for your jobs to restart automatically after a jobmanager crash or
restart. We had spent some cycles trying to implement job
restarting and watchdogs (poorly) when I discoverd this from a
flink forward presentation on youtube. 2. I seem to have found some odd behavior with HA and then found
something that works, but I can't explain why. The clifnotes
version is that I took an existing standalone cluster with a
single JM and modified with high availability zookeeper mode. The
same flink-conf.yaml file is used on all nodes (including JM).
This seemed to work fine, I restarted the JM (jm0) and the jobs
relaunched when it came back. Easy! Then I deployed a second JM
(jm1). Once I modified `masters`, set the HA rpc port range and
opened those ports on the firewall for both jobmanagers, but left
`jobmanager.rpc.address` the original value, `jm0` on all nodes.
I then observed that jm0 worked fine, taskmanagers connected to it
and jobs ran. jm1 did not 301 me to jm0 however, it displayed a
dashboard (no jobs, no tm). When I stopped jm0, the jobs show up
on jm1 as RESTARTING, but the taskmanagers never attach to jm1.
In the logs, all nodes, including jm1, had messages about trying
to reach jm0. From the documentation and various comments I've
seen, `jobmanager.rpc.address` should be ignored. However,
commenting it out entirely lead to jobmanagers crashing at boot,
setting to `localhost` caused all the taskmanagers to log messages
about trying to connect to the jobmanager at localhost. What
finally worked was to set the value to the hostname where the
flink-conf.yaml was individually, even on the taskmanagers. Does this seem like a bug? Just a hunch, but is there something called an "akka leader" that is different from the jobmanager leader, and could it be somehow defaulting its value over to jobmanager.rpc.address?
|
Hi Derek, 1. I've created a JIRA issue to improve the docs as you recommended [1].2018-05-05 15:34 GMT+02:00 Derek VerLee <[hidden email]>:
|
Hi Derek, given that you've started the different Flink cluster components all with the same HA enabled configuration, the TMs should be able to connect to jm1 after you've killed jm0. The jobmanager.rpc.address should not be used when HA mode is enabled. In order to get to the bottom of the described problem, it would be tremendously helpful to get access to the logs of all components (jm0, jm1 and the TMs). Additionally, it would be good to know which Flink version you're using. Cheers, Till On Mon, May 7, 2018 at 2:38 PM, Fabian Hueske <[hidden email]> wrote:
|
Alright, try to grab the logs if you see this problem reoccurring. I would be interested in understanding why this happens. Cheers, Till On Fri, May 18, 2018 at 9:45 PM, Derek VerLee <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |