Flink network access control documentation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink network access control documentation

Elias Levy
There is a need for better documentation on what connects to what over which ports in a Flink cluster to allow users to configure network access control rules.

I was under the impression that in a ZK HA configuration the Job Managers were essentially independent and only coordinated via ZK.  But starting multiple JMs in HA with the JM RPC port blocked between JMs shows that the second JM's Akka subsystem is trying to connect to the leading JM:

INFO  akka.remote.transport.ProtocolStateActor                      - No response from remote for outbound association. Associate timed out after [20000 ms].
WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@10.210.210.127:6123] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@10.210.210.127:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].]
WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.210.210.127:6123

Reply | Threaded
Open this post in threaded view
|

Re: Flink network access control documentation

Ufuk Celebi
Hey Elias,

thanks for opening a ticket (for reference:
https://issues.apache.org/jira/browse/FLINK-8311). I fully agree with
adding docs for this. I will try to write something down this week.

Your point about JobManagers only coordinating via ZK is correct
though. I had a look into the JobManager code (as of 1.4) and the
leader election service only reads and writes leader information into
ZK which is then picked up by the TaskManagers.

What you are seeing here is related to the web UI which is attached to
every JM. The UI tries to connect to the leading JM in order to access
runtime information of the leading JM. This is not documented anywhere
as far as I can tell and might have changed between 1.3 and 1.4. The
port should not be critical to the functioning of your Flink cluster,
but only for accessing the web UI on a non-leading JM.

– Ufuk


On Fri, Dec 22, 2017 at 8:36 PM, Elias Levy <[hidden email]> wrote:

> There is a need for better documentation on what connects to what over which
> ports in a Flink cluster to allow users to configure network access control
> rules.
>
> I was under the impression that in a ZK HA configuration the Job Managers
> were essentially independent and only coordinated via ZK.  But starting
> multiple JMs in HA with the JM RPC port blocked between JMs shows that the
> second JM's Akka subsystem is trying to connect to the leading JM:
>
> INFO  akka.remote.transport.ProtocolStateActor                      - No
> response from remote for outbound association. Associate timed out after
> [20000 ms].
> WARN  akka.remote.ReliableDeliverySupervisor                        -
> Association with remote system [akka.tcp://flink@10.210.210.127:6123] has
> failed, address is now gated for [5000] ms. Reason: [Association failed with
> [akka.tcp://flink@10.210.210.127:6123]] Caused by: [No response from remote
> for outbound association. Associate timed out after [20000 ms].]
> WARN  akka.remote.transport.netty.NettyTransport                    - Remote
> connection to [null] failed with
> org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
> connection timed out: /10.210.210.127:6123
>