Re: Interacting with flink-jobmanager via CLI in separate pod

Posted by Robert Cullen on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Interacting-with-flink-jobmanager-via-CLI-in-separate-pod-tp43456p43488.html

Thanks for the reply. Here is an updated exception with DEBUG on. It appears to be timing out:

2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting namespace of Kubernetes client to cmdaa
2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting max concurrent requests of Kubernetes client to 64
2021-05-05 16:56:20,176 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://10.43.0.1:30081
2021-05-05 16:56:20,239 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Waiting for response...
2021-05-05 17:02:09,605 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
        at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:449) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) [flink-dist_2.12-1.13.0.jar:1.13.0]
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
ngleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]

On Wed, May 5, 2021 at 6:59 AM Robert Metzger <[hidden email]> wrote:
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490