Hi Sathya,
have you checked this yet? https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/ jobmanager_high_availability.html I'm no expert on the HA setup, have you also tried Flink 1.3 just in case? Nico On Wednesday, 8 November 2017 04:02:47 CET Sathya Hariesh Prakash (sathypra) wrote: > Hi – We’re currently testing Flink HA and running into a zookeeper timeout > issue. Error log below. > Is there a production checklist or any information on parameters that are > related to flink HA that I need to pay attention to? > Any pointers would really help. Please let me know if any additional > information is needed. Thanks! > NOTE: I see multiple connection timeout messages. With different elapsed > times. > { > "timeMillis":1510095254557, > "thread":"Curator-Framework-0", > "level":"ERROR", > > "loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState", > "message":"Connection timed out for connection string > (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed > (15004)", "thrown":{ > "commonElementCount":0, > "localizedMessage":"KeeperErrorCode = ConnectionLoss", > "message":"KeeperErrorCode = ConnectionLoss", > > "name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossExc > eption", > { > > "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState", > "method":"checkTimeouts", > "file":"ConnectionState.java", > "line":197, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState", > "method":"getZooKeeper", > "file":"ConnectionState.java", > "line":87, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient" > , > "file":"CuratorZookeeperClient.java", > "line":115, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF > rameworkImpl", > "file":"CuratorFrameworkImpl.java", > "line":806, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF > rameworkImpl", > "file":"CuratorFrameworkImpl.java", > "line":792, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF > rameworkImpl", > "file":"CuratorFrameworkImpl.java", > "line":62, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > > "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF > rameworkImpl$4", > "file":"CuratorFrameworkImpl.java", > "line":257, > "exact":true, > "location":"flink-runtime_2.10-1.2.jar", > "version":"1.2" > }, > { > "class":"java.util.concurrent.FutureTask", > "method":"run", > "file":"FutureTask.java", > "line":266, > "exact":true, > "location":"?", > "version":"1.8.0_66" > }, > { > "class":"java.util.concurrent.ThreadPoolExecutor", > "method":"runWorker", > "file":"ThreadPoolExecutor.java", > "line":1142, > "exact":true, > "location":"?", > "version":"1.8.0_66" > }, > { > "class":"java.util.concurrent.ThreadPoolExecutor$Worker", > "method":"run", > "file":"ThreadPoolExecutor.java", > "line":617, > "exact":true, > "location":"?", > "version":"1.8.0_66" > }, > { > "class":"java.lang.Thread", > "method":"run", > "file":"Thread.java", > "line":745, > "exact":true, > "location":"?", > "version":"1.8.0_66" > } > ] > }, > "endOfBatch":false, > "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger", > "threadId":258, > "threadPriority":5 > } signature.asc (201 bytes) Download Attachment |
In reply to this post by Sathya Hariesh Prakash (sathypra)
Hi Sathya,
Here are two JIRA issues that may be related: FLINK-5996, FLINK-7021 Are there any logs from your ZK cluster that may be of use? Since you're on Kubernetes, do you have Liveness/ReadinessChecks on ZK, and if so, do they show any problems? For example, a failed ReadinessCheck could result in the node temporarily being dropped from the K8s Service, resulting in a timeout from Flink. Actually, it's probably a good idea to avoid using a Service altogether with ZooKeeper in Kubernetes and address the pods directly. For this you could use a StatefulSet which gives you hostnames like zookeeper-0, zookeeper-1 etc., avoiding the indirection of a Service and allowing the client library to do its own failure resolution since it knows where to find each ZooKeeper. -- Patrick Lucas On Wed, Nov 8, 2017 at 4:02 AM, Sathya Hariesh Prakash (sathypra) <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |