There is a need for better documentation on what connects to what over which ports in a Flink cluster to allow users to configure network access control rules.
I was under the impression that in a ZK HA configuration the Job Managers were essentially independent and only coordinated via ZK. But starting multiple JMs in HA with the JM RPC port blocked between JMs shows that the second JM's Akka subsystem is trying to connect to the leading JM: INFO akka.remote.transport.ProtocolStateActor - No response from remote for outbound association. Associate timed out after [20000 ms]. WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.210.210.127:6123] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@10.210.210.127:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].] WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.210.210.127:6123 |
Hey Elias,
thanks for opening a ticket (for reference: https://issues.apache.org/jira/browse/FLINK-8311). I fully agree with adding docs for this. I will try to write something down this week. Your point about JobManagers only coordinating via ZK is correct though. I had a look into the JobManager code (as of 1.4) and the leader election service only reads and writes leader information into ZK which is then picked up by the TaskManagers. What you are seeing here is related to the web UI which is attached to every JM. The UI tries to connect to the leading JM in order to access runtime information of the leading JM. This is not documented anywhere as far as I can tell and might have changed between 1.3 and 1.4. The port should not be critical to the functioning of your Flink cluster, but only for accessing the web UI on a non-leading JM. – Ufuk On Fri, Dec 22, 2017 at 8:36 PM, Elias Levy <[hidden email]> wrote: > There is a need for better documentation on what connects to what over which > ports in a Flink cluster to allow users to configure network access control > rules. > > I was under the impression that in a ZK HA configuration the Job Managers > were essentially independent and only coordinated via ZK. But starting > multiple JMs in HA with the JM RPC port blocked between JMs shows that the > second JM's Akka subsystem is trying to connect to the leading JM: > > INFO akka.remote.transport.ProtocolStateActor - No > response from remote for outbound association. Associate timed out after > [20000 ms]. > WARN akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink@10.210.210.127:6123] has > failed, address is now gated for [5000] ms. Reason: [Association failed with > [akka.tcp://flink@10.210.210.127:6123]] Caused by: [No response from remote > for outbound association. Associate timed out after [20000 ms].] > WARN akka.remote.transport.netty.NettyTransport - Remote > connection to [null] failed with > org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: > connection timed out: /10.210.210.127:6123 > |
Free forum by Nabble | Edit this page |