Flink HA Zookeeper Connection Timeout

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink HA Zookeeper Connection Timeout

Sathya Hariesh Prakash (sathypra)
Hi – We’re currently testing Flink HA and running into a zookeeper timeout issue. Error log below.

Is there a production checklist or any information on parameters that are related to flink HA that I need to pay attention to? 

Any pointers would really help. Please let me know if any additional information is needed. Thanks!

NOTE: I see multiple connection timeout messages. With different elapsed times.

 
   "timeMillis":1510095254557,
   "thread":"Curator-Framework-0",
   "level":"ERROR",
   "loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
   "message":"Connection timed out for connection string (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed (15004)",
   "thrown": 
      "commonElementCount":0,
      "localizedMessage":"KeeperErrorCode = ConnectionLoss",
      "message":"KeeperErrorCode = ConnectionLoss",
      "name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException",
      "extendedStackTrace": 
          
            "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
            "method":"checkTimeouts",
            "file":"ConnectionState.java",
            "line":197,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
            "method":"getZooKeeper",
            "file":"ConnectionState.java",
            "line":87,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient",
            "method":"getZooKeeper",
            "file":"CuratorZookeeperClient.java",
            "line":115,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"performBackgroundOperation",
            "file":"CuratorFrameworkImpl.java",
            "line":806,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"backgroundOperationsLoop",
            "file":"CuratorFrameworkImpl.java",
            "line":792,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"access$300",
            "file":"CuratorFrameworkImpl.java",
            "line":62,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4",
            "method":"call",
            "file":"CuratorFrameworkImpl.java",
            "line":257,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"java.util.concurrent.FutureTask",
            "method":"run",
            "file":"FutureTask.java",
            "line":266,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.util.concurrent.ThreadPoolExecutor",
            "method":"runWorker",
            "file":"ThreadPoolExecutor.java",
            "line":1142,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
            "method":"run",
            "file":"ThreadPoolExecutor.java",
            "line":617,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.lang.Thread",
            "method":"run",
            "file":"Thread.java",
            "line":745,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         }
      ]
   },
   "endOfBatch":false,
   "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
   "threadId":258,
   "threadPriority":5
}
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA Zookeeper Connection Timeout

Nico Kruber
Hi Sathya,
have you checked this yet?
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/
jobmanager_high_availability.html

I'm no expert on the HA setup, have you also tried Flink 1.3 just in case?


Nico

On Wednesday, 8 November 2017 04:02:47 CET Sathya Hariesh Prakash (sathypra)
wrote:
> Hi – We’re currently testing Flink HA and running into a zookeeper timeout
> issue. Error log below.
 
> Is there a production checklist or any information on parameters that are
> related to flink HA that I need to pay attention to?
 
> Any pointers would really help. Please let me know if any additional
> information is needed. Thanks!
 
> NOTE: I see multiple connection timeout messages. With different elapsed
> times.
 

> {
>    "timeMillis":1510095254557,
>    "thread":"Curator-Framework-0",
>    "level":"ERROR",
>  
> "loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "message":"Connection timed out for connection string
> (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed
> (15004)", "thrown":{
>       "commonElementCount":0,
>       "localizedMessage":"KeeperErrorCode = ConnectionLoss",
>       "message":"KeeperErrorCode = ConnectionLoss",
>      
> "name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossExc
> eption",
 "extendedStackTrace":[

>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "method":"checkTimeouts",
>             "file":"ConnectionState.java",
>             "line":197,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
> "method":"getZooKeeper",
>             "file":"ConnectionState.java",
>             "line":87,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient"
> ,
 "method":"getZooKeeper",

>             "file":"CuratorZookeeperClient.java",
>             "line":115,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
 "method":"performBackgroundOperation",

>             "file":"CuratorFrameworkImpl.java",
>             "line":806,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
 "method":"backgroundOperationsLoop",

>             "file":"CuratorFrameworkImpl.java",
>             "line":792,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl",
 "method":"access$300",

>             "file":"CuratorFrameworkImpl.java",
>             "line":62,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>            
> "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorF
> rameworkImpl$4",
 "method":"call",

>             "file":"CuratorFrameworkImpl.java",
>             "line":257,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"java.util.concurrent.FutureTask",
>             "method":"run",
>             "file":"FutureTask.java",
>             "line":266,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.util.concurrent.ThreadPoolExecutor",
>             "method":"runWorker",
>             "file":"ThreadPoolExecutor.java",
>             "line":1142,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
>             "method":"run",
>             "file":"ThreadPoolExecutor.java",
>             "line":617,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.lang.Thread",
>             "method":"run",
>             "file":"Thread.java",
>             "line":745,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          }
>       ]
>    },
>    "endOfBatch":false,
>    "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
>    "threadId":258,
>    "threadPriority":5
> }


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA Zookeeper Connection Timeout

Patrick Lucas
In reply to this post by Sathya Hariesh Prakash (sathypra)
Hi Sathya,

Here are two JIRA issues that may be related: FLINK-5996FLINK-7021

Are there any logs from your ZK cluster that may be of use? Since you're on Kubernetes, do you have Liveness/ReadinessChecks on ZK, and if so, do they show any problems? For example, a failed ReadinessCheck could result in the node temporarily being dropped from the K8s Service, resulting in a timeout from Flink.

Actually, it's probably a good idea to avoid using a Service altogether with ZooKeeper in Kubernetes and address the pods directly. For this you could use a StatefulSet which gives you hostnames like zookeeper-0, zookeeper-1 etc., avoiding the indirection of a Service and allowing the client library to do its own failure resolution since it knows where to find each ZooKeeper.

--
Patrick Lucas

On Wed, Nov 8, 2017 at 4:02 AM, Sathya Hariesh Prakash (sathypra) <[hidden email]> wrote:
Hi – We’re currently testing Flink HA and running into a zookeeper timeout issue. Error log below.

Is there a production checklist or any information on parameters that are related to flink HA that I need to pay attention to? 

Any pointers would really help. Please let me know if any additional information is needed. Thanks!

NOTE: I see multiple connection timeout messages. With different elapsed times.

 
   "timeMillis":1510095254557,
   "thread":"Curator-Framework-0",
   "level":"ERROR",
   "loggerName":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
   "message":"Connection timed out for connection string (zookeeper.system.svc.cluster.local:2181) and timeout (15000) / elapsed (15004)",
   "thrown": 
      "commonElementCount":0,
      "localizedMessage":"KeeperErrorCode = ConnectionLoss",
      "message":"KeeperErrorCode = ConnectionLoss",
      "name":"org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException",
      "extendedStackTrace": 
          
            "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
            "method":"checkTimeouts",
            "file":"ConnectionState.java",
            "line":197,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.ConnectionState",
            "method":"getZooKeeper",
            "file":"ConnectionState.java",
            "line":87,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient",
            "method":"getZooKeeper",
            "file":"CuratorZookeeperClient.java",
            "line":115,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"performBackgroundOperation",
            "file":"CuratorFrameworkImpl.java",
            "line":806,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"backgroundOperationsLoop",
            "file":"CuratorFrameworkImpl.java",
            "line":792,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl",
            "method":"access$300",
            "file":"CuratorFrameworkImpl.java",
            "line":62,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4",
            "method":"call",
            "file":"CuratorFrameworkImpl.java",
            "line":257,
            "exact":true,
            "location":"flink-runtime_2.10-1.2.jar",
            "version":"1.2"
         },
          
            "class":"java.util.concurrent.FutureTask",
            "method":"run",
            "file":"FutureTask.java",
            "line":266,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.util.concurrent.ThreadPoolExecutor",
            "method":"runWorker",
            "file":"ThreadPoolExecutor.java",
            "line":1142,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
            "method":"run",
            "file":"ThreadPoolExecutor.java",
            "line":617,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         },
          
            "class":"java.lang.Thread",
            "method":"run",
            "file":"Thread.java",
            "line":745,
            "exact":true,
            "location":"?",
            "version":"1.8.0_66"
         }
      ]
   },
   "endOfBatch":false,
   "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
   "threadId":258,
   "threadPriority":5
}