Hi running 1.8 the cluster seems to be OK but I see these warnings in the logs...
2019-10-03 14:57:25,152 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /xxx.xxx.xxx.65:46167 2019-10-03 14:57:25,156 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[hidden email].65:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email].65:46167]] Caused by: [Connection refused: /xxx.xxx.xxx.65:46167] |
Hi John, could you provide some details such as which mode you runs on(standalone/YARN) and related configuration(jobmanager.address jobmanager.port and so on)? Best, tison. John Smith <[hidden email]> 于2019年10月3日周四 下午11:02写道:
|
I'm running standalone cluster with Zookeeper. It seems it was trying to connect to an older node. I rebooted the Job node tha was complaining. It seems to be ok now... I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes On Thu, 3 Oct 2019 at 11:15, Zili Chen <[hidden email]> wrote:
|
So I guess it had some older state? On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <[hidden email]> wrote:
|
Does the log you attached above come from a TaskManager Node? If so, what state is the Job node it tried to connect to? Did it crash? BTW, it would be helpful if you can attach more logs of TM and JM except two lines said akka connection refused. John Smith <[hidden email]> 于2019年10月4日周五 上午2:08写道:
|
Sorry been away on leave. I'll check ASAP. On Thu, 3 Oct 2019 at 20:52, Zili Chen <[hidden email]> wrote:
|
We see a very similar (if not the same) error running version 1.9 on Kubernetes. So far what we have discovered is that a taskmanager gets killed and a new one is created, but JM still thinks it needs to connect to the old (now dead TM). I was even able to see the a taskmanager on the same host and port but with different TM instance ids in the Flink UI. The issue seems to be persistent (i.e. doesn't clear after a few minutes). FWIW...TM was dying due to livenessprobe in K8s. We have increased that, but still the above issue is a concern. Any ideas? Tim On Wed, Oct 9, 2019, 3:15 PM John Smith <[hidden email]> wrote:
|
Ok so it seems there was some sort of network issue. Then leader election. But it seems it had some old state and kept trying to connect to the same task machine over and over...? 2019-09-19 22:26:14,841 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0xXXXXXX, likely server has closed socket, closing socket connection and attempting reconnect 2019-09-19 22:26:14,946 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2019-09-19 22:26:14,947 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no longer participates in the leader election. 2019-09-19 22:26:14,947 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-09-19 22:26:14,947 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-09-19 22:26:14,948 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager no longer participates in the leader election. 2019-09-19 22:26:14,948 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher no longer participates in the leader election. 2019-09-19 22:26:14,949 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily). 2019-09-19 22:26:25,185 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-09-19 22:26:25,186 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server XXXXXX.71/XXXXXX.71:2181 2019-09-19 22:26:25,186 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2019-09-19 22:26:25,192 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating session 2019-09-19 22:26:25,199 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has expired 2019-09-19 22:26:25,199 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: LOST 2019-09-19 22:26:25,199 WARN org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Session expired event received 2019-09-19 22:26:25,199 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no longer participates in the leader election. 2019-09-19 22:26:25,199 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper lost. Can no longer retrieve the leader from ZooKeeper. 2019-09-19 22:26:25,200 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper lost. Can no longer retrieve the leader from ZooKeeper. 2019-09-19 22:26:25,199 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f 2019-09-19 22:26:25,200 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager no longer participates in the leader election. 2019-09-19 22:26:25,200 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/dispatcher no longer participates in the leader election. 2019-09-19 22:26:25,201 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has expired, closing socket connection 2019-09-19 22:26:25,201 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection LOST. Changes to the submitted job graphs are not monitored (permanently). 2019-09-19 22:26:25,220 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3017fc1a6660000 2019-09-19 22:26:25,231 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-09-19 22:26:25,232 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server XXXXXX.33/XXXXXX.33:2181 2019-09-19 22:26:25,232 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2019-09-19 22:26:25,233 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating session 2019-09-19 22:26:25,247 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server XXXXXX.33/XXXXXX.33:2181, sessionid = 0x301db1787060000, negotiated timeout = 40000 2019-09-19 22:26:25,247 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED 2019-09-19 22:26:25,248 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-09-19 22:26:25,253 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-09-19 22:26:25,253 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-09-19 22:26:25,253 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-09-19 22:26:25,253 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-09-19 22:26:25,253 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 2019-09-19 22:26:34,376 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2019-09-19 22:26:34,376 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2019-09-19 22:26:35,147 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:26:35,149 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:26:45,167 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:26:45,168 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:26:55,151 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:26:55,153 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:27:05,159 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:27:05,160 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:27:15,157 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:27:15,161 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:27:25,152 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:27:25,160 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] 2019-09-19 22:27:35,161 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /XXXXXX.11:46167 2019-09-19 22:27:35,165 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]] Caused by: [Connection refused: /XXXXXX.11:46167] On Wed, 9 Oct 2019 at 19:44, Timothy Victor <[hidden email]> wrote:
|
Hi John, the reason why you are seeing these warnings is because Akka tries to re-establish the connection to a lost endpoint (here a dead TaskExecutor). This should continue until the connection is either quarantined or if the underlying ActorRef to the remote endpoint has been garbage collected. The former should not really happen and the latter should happen after Flink has realized that the TaskExecutor has died. Flink uses its own heartbeats to detect this. Depending on the configuration (default value is 50s), this can take a bit. However, the warnings should eventually stop to be displayed. I admit that this is not ideal in a scenario where TaskExecutors die regularly but it helps to debug problematic scenarios. One way to suppress these statements is to set the logger for akka.remote to ERROR. But then one would not see if Akka has lost the connection and tries to reconnect. Cheers, Till On Thu, Oct 10, 2019 at 5:31 PM John Smith <[hidden email]> wrote:
|
Oh that's fine. I was just wondering why it happened. It seems to have gone away since the reboot. On Fri, 18 Oct 2019 at 10:43, Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |