(DEPRECATED) Apache Flink User Mailing List archive.

flink list and flink run commands timeout

Classic

List

Threaded

10 messages Options

Jason Kania

flink list and flink run commands timeout

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Chesnay Schepler

Re: flink list and flink run commands timeout

Please enable DEBUG logging for the client and TRACE logging for the cluster.

For the client, look for log messages starting with "Sending request of", this will contain the host and port that requests are sent to by the client. Verify that these are correct and match the host/port that you use when accessing the web UI.

For the server, look for log messages starting with "Received request", so we can figure out whether the request at least arrives.

On 05.09.2018 00:53, Jason Kania wrote:

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Jason Kania

Re: flink list and flink run commands timeout

Hello,

Thanks for the response. I had already tried setting the log level to debug in log4j-cli.properties, logback-console.xml, and log4j-console.properties but no additional relevant information comes out. On the server, all that comes out are zookeeper ping responses:

2018-09-05 15:16:56,786 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x3659b60bcb50076 after 1ms

The client log indicates only the following (but we are not using hadoop):

2018-09-05 15:19:53,339 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.

java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:264)

at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)

at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 5 more

and

2018-09-05 15:19:53,881 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

despite the zookeeper being configured as 'open' and latest logs showing data being read from zookeeper.

2018-09-05 15:19:54,274 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,47244656277,0 request:: '/flink,F response:: s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}

Much like the basic log output, the detailed trace shows no additional information, just a gap after waiting for the response:

2018-09-05 15:19:54,313 INFO org.apache.flink.client.cli.CliFrontend - Waiting for response...

2018-09-05 15:20:07,635 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms

2018-09-05 15:20:20,976 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms

2018-09-05 15:20:24,311 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.

2018-09-05 15:20:24,317 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.

2018-09-05 15:20:24,318 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-09-05 15:20:24,320 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-09-05 15:20:24,320 DEBUG org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Closing

2018-09-05 15:20:24,321 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting

2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient - Closing

2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Closing

2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Closing session: 0x265a12437df0074

2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Closing client for session: 0x265a12437df0074

2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,47244656278,0 request:: null response:: null

2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Disconnecting client for session: 0x265a12437df0074

2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x265a12437df0074 closed

2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x265a12437df0074

2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.

On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler <[hidden email]> wrote:

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Gary Yao-2

Re: flink list and flink run commands timeout

Hi Jason,

From the stacktrace it seems that you are using the 1.4.0 client to list jobs
on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.

Best,
Gary

On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <[hidden email]> wrote:

Hello,

Thanks for the response. I had already tried setting the log level to debug in log4j-cli.properties, logback-console.xml, and log4j-console.properties but no additional relevant information comes out. On the server, all that comes out are zookeeper ping responses:

2018-09-05 15:16:56,786 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x3659b60bcb50076 after 1ms

The client log indicates only the following (but we are not using hadoop):

2018-09-05 15:19:53,339 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more

and

2018-09-05 15:19:53,881 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

despite the zookeeper being configured as 'open' and latest logs showing data being read from zookeeper.

2018-09-05 15:19:54,274 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,47244656277,0 request:: '/flink,F response:: s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}

Much like the basic log output, the detailed trace shows no additional information, just a gap after waiting for the response:

2018-09-05 15:19:54,313 INFO org.apache.flink.client.cli.CliFrontend - Waiting for response...
2018-09-05 15:20:07,635 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:20,976 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:24,311 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.
2018-09-05 15:20:24,317 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.
2018-09-05 15:20:24,318 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2018-09-05 15:20:24,320 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-09-05 15:20:24,320 DEBUG org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Closing
2018-09-05 15:20:24,321 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient - Closing
2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Closing
2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Closing session: 0x265a12437df0074
2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Closing client for session: 0x265a12437df0074
2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,47244656278,0 request:: null response:: null
2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Disconnecting client for session: 0x265a12437df0074
2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x265a12437df0074 closed
2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x265a12437df0074
2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.

On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler <[hidden email]> wrote:

Please enable DEBUG logging for the client and TRACE logging for the cluster.

For the client, look for log messages starting with "Sending request of", this will contain the host and port that requests are sent to by the client. Verify that these are correct and match the host/port that you use when accessing the web UI.

For the server, look for log messages starting with "Received request", so we can figure out whether the request at least arrives.

On 05.09.2018 00:53, Jason Kania wrote:

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Aneesha Kaushal-2

Re: flink list and flink run commands timeout

Hello,

I am facing the same Timeout exception, at flink run and flink list commands when I am trying to deploy jobs in Flink 1.6 in “legacy" mode.

We are planning to run in legacy mode because after upgrading from Flink 1.3 to Flink 1.6, flink job was not getting distributed across task managers.

In “new" mode jobs are working fine.

Any suggestions?

org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: d6686e184897e8799d71008488ccf80e)
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
	at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException

Thanks,

Aneesha Kaushal

On 06-Sep-2018, at 10:45 AM, Gary Yao <[hidden email]> wrote:

Hi Jason,

From the stacktrace it seems that you are using the 1.4.0 client to list jobs
on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.

Best,
Gary

On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <[hidden email]> wrote:
Hello,

Thanks for the response. I had already tried setting the log level to debug in log4j-cli.properties, logback-console.xml, and log4j-console.properties but no additional relevant information comes out. On the server, all that comes out are zookeeper ping responses:

2018-09-05 15:16:56,786 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x3659b60bcb50076 after 1ms

The client log indicates only the following (but we are not using hadoop):

2018-09-05 15:19:53,339 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more

and

2018-09-05 15:19:53,881 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

despite the zookeeper being configured as 'open' and latest logs showing data being read from zookeeper.

2018-09-05 15:19:54,274 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,47244656277,0 request:: '/flink,F response:: s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}

Much like the basic log output, the detailed trace shows no additional information, just a gap after waiting for the response:

2018-09-05 15:19:54,313 INFO org.apache.flink.client.cli.CliFrontend - Waiting for response...
2018-09-05 15:20:07,635 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:20,976 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:24,311 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.
2018-09-05 15:20:24,317 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.
2018-09-05 15:20:24,318 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2018-09-05 15:20:24,320 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-09-05 15:20:24,320 DEBUG org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Closing
2018-09-05 15:20:24,321 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient - Closing
2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Closing
2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Closing session: 0x265a12437df0074
2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Closing client for session: 0x265a12437df0074
2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,47244656278,0 request:: null response:: null
2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Disconnecting client for session: 0x265a12437df0074
2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x265a12437df0074 closed
2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x265a12437df0074
2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.

On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler <[hidden email]> wrote:

Please enable DEBUG logging for the client and TRACE logging for the cluster.

For the client, look for log messages starting with "Sending request of", this will contain the host and port that requests are sent to by the client. Verify that these are correct and match the host/port that you use when accessing the web UI.

For the server, look for log messages starting with "Received request", so we can figure out whether the request at least arrives.

On 05.09.2018 00:53, Jason Kania wrote:

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Chesnay Schepler

Re: flink list and flink run commands timeout

Based on the stacktrace the client is not running in legacy mode; please check the client flink-conf.yaml.

On 03.12.2018 12:10, Aneesha Kaushal wrote:

Hello,
I am facing the same Timeout exception, at flink run and flink list commands when I am trying to deploy jobs in Flink 1.6 in “legacy" mode.

We are planning to run in legacy mode because after upgrading from Flink 1.3 to Flink 1.6, flink job was not getting distributed across task managers.

In “new" mode jobs are working fine.

Any suggestions?
org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: d6686e184897e8799d71008488ccf80e)
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
	at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
Thanks,

Aneesha Kaushal

On 06-Sep-2018, at 10:45 AM, Gary Yao <[hidden email]> wrote:

Hi Jason,

From the stacktrace it seems that you are using the 1.4.0 client to list jobs
on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.

Best,
Gary

On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <[hidden email]> wrote:

Hello,

Thanks for the response. I had already tried setting the log level to debug in log4j-cli.properties, logback-console.xml, and log4j-console.properties but no additional relevant information comes out. On the server, all that comes out are zookeeper ping responses:

2018-09-05 15:16:56,786 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x3659b60bcb50076 after 1ms

The client log indicates only the following (but we are not using hadoop):

2018-09-05 15:19:53,339 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.

java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:264)

at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)

at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 5 more

and

2018-09-05 15:19:53,881 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

despite the zookeeper being configured as 'open' and latest logs showing data being read from zookeeper.

2018-09-05 15:19:54,274 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,47244656277,0 request:: '/flink,F response:: s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}

Much like the basic log output, the detailed trace shows no additional information, just a gap after waiting for the response:

2018-09-05 15:19:54,313 INFO org.apache.flink.client.cli.CliFrontend - Waiting for response...

2018-09-05 15:20:07,635 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms

2018-09-05 15:20:20,976 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x265a12437df0074 after 1ms

2018-09-05 15:20:24,311 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.

2018-09-05 15:20:24,317 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.

2018-09-05 15:20:24,318 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-09-05 15:20:24,320 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-09-05 15:20:24,320 DEBUG org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - Closing

2018-09-05 15:20:24,321 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting

2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient - Closing

2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Closing

2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Closing session: 0x265a12437df0074

2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Closing client for session: 0x265a12437df0074

2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,47244656278,0 request:: null response:: null

2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Disconnecting client for session: 0x265a12437df0074

2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x265a12437df0074 closed

2018-09-05 15:20:24,330 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x265a12437df0074

2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.

On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler <[hidden email]> wrote:

Please enable DEBUG logging for the client and TRACE logging for the cluster.

For the client, look for log messages starting with "Sending request of", this will contain the host and port that requests are sent to by the client. Verify that these are correct and match the host/port that you use when accessing the web UI.

For the server, look for log messages starting with "Received request", so we can figure out whether the request at least arrives.

On 05.09.2018 00:53, Jason Kania wrote:

I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster configured with HA. Now I am encountering an issue where the flink command line operations timeout. The exception generated is very poor because it only indicates a timeout and not the reason or what it was trying to do:

>./flink list -f

Waiting for response...

------------------------------------------------------------

The program finished with the following exception:

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)

at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)

at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

... 10 more

Caused by: java.util.concurrent.TimeoutException

The web interface shows the 2 job managers and 3 task managers that are talking with one another.

I have looked at the zookeeper data and it is all present.

I have tried running the command on multiple nodes and they all give the same error.

I looked for a verbose or debug option for the commands but found nothing.

Suggestions on this?

Thanks,

Jason

Aneesha Kaushal-2

Re: flink list and flink run commands timeout

Thanks Chesnay! The exception is gone now.

On 03-Dec-2018, at 5:22 PM, Chesnay Schepler <[hidden email]> wrote:

Based on the stacktrace the client is not running in legacy mode; please check the client flink-conf.yaml.

Aneesha Kaushal-2

Re: flink list and flink run commands timeout

In reply to this post by Chesnay Schepler

Thanks Chesnay! The exception is gone now.

On 03-Dec-2018, at 5:22 PM, Chesnay Schepler <[hidden email]> wrote:

Based on the stacktrace the client is not running in legacy mode; please check the client flink-conf.yaml.

sen

Re: flink list and flink run commands timeout

Hi Aneesha:

I am also facing the same problem.When I turn on the HA on yarn ,it
will get the same exception. While I turn off the Ha configuration ,it works
fine.
I want to know that what did you do to deal with the problem?

Thanks!
Sen Sun

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Gary Yao-4

Re: flink list and flink run commands timeout

Hi Sen Sun,

The question is already resolved. You can find the entire email thread here:

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/flink-list-and-flink-run-commands-timeout-td22826.html

Best,
Gary

On Wed, Feb 27, 2019 at 7:55 AM sen <[hidden email]> wrote:

Hi Aneesha:

I am also facing the same problem.When I turn on the HA on yarn ,it
will get the same exception. While I turn off the Ha configuration ,it works
fine.
I want to know that what did you do to deal with the problem?

Thanks!
Sen Sun

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/