Interacting with flink-jobmanager via CLI in separate pod

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Interacting with flink-jobmanager via CLI in separate pod

Robert Cullen

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Interacting with flink-jobmanager via CLI in separate pod

rmetzger0
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Interacting with flink-jobmanager via CLI in separate pod

Robert Cullen

Thanks for the reply. Here is an updated exception with DEBUG on. It appears to be timing out:

2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting namespace of Kubernetes client to cmdaa
2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting max concurrent requests of Kubernetes client to 64
2021-05-05 16:56:20,176 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://10.43.0.1:30081
2021-05-05 16:56:20,239 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Waiting for response...
2021-05-05 17:02:09,605 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
        at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:449) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) [flink-dist_2.12-1.13.0.jar:1.13.0]
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
ngleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]

On Wed, May 5, 2021 at 6:59 AM Robert Metzger <[hidden email]> wrote:
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Interacting with flink-jobmanager via CLI in separate pod

rmetzger0
Okay, it appears to have resolved 10.43.0.1:30081 as the address of the JobManager. Most likely, the container can not access this address. Can you validate this from within the container?

If I understand the Flink documentation correctly, you should be able to manually specify rest.addressrest.port for the JobManager address. If you can manually figure out an address to the JobManager service, and pass it to Flink, the submission should work.

On Wed, May 5, 2021 at 7:15 PM Robert Cullen <[hidden email]> wrote:

Thanks for the reply. Here is an updated exception with DEBUG on. It appears to be timing out:

2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting namespace of Kubernetes client to cmdaa
2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting max concurrent requests of Kubernetes client to 64
2021-05-05 16:56:20,176 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://10.43.0.1:30081
2021-05-05 16:56:20,239 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Waiting for response...
2021-05-05 17:02:09,605 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
        at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:449) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) [flink-dist_2.12-1.13.0.jar:1.13.0]
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
ngleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]

On Wed, May 5, 2021 at 6:59 AM Robert Metzger <[hidden email]> wrote:
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Interacting with flink-jobmanager via CLI in separate pod

Yang Wang
It seems that you are using the NodePort to expose the rest service. If you only want to access the Flink UI/rest in the K8s cluster,
then I would suggest to set "kubernetes.rest-service.exposed.type" to "ClusterIP". Because we are using the K8s master node to
construct the JobManager rest endpoint when using NodePort. Sometime, it is not accessible due to firewall.

Best,
Yang

Robert Metzger <[hidden email]> 于2021年5月6日周四 上午2:08写道:
Okay, it appears to have resolved 10.43.0.1:30081 as the address of the JobManager. Most likely, the container can not access this address. Can you validate this from within the container?

If I understand the Flink documentation correctly, you should be able to manually specify rest.addressrest.port for the JobManager address. If you can manually figure out an address to the JobManager service, and pass it to Flink, the submission should work.

On Wed, May 5, 2021 at 7:15 PM Robert Cullen <[hidden email]> wrote:

Thanks for the reply. Here is an updated exception with DEBUG on. It appears to be timing out:

2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting namespace of Kubernetes client to cmdaa
2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting max concurrent requests of Kubernetes client to 64
2021-05-05 16:56:20,176 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://10.43.0.1:30081
2021-05-05 16:56:20,239 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Waiting for response...
2021-05-05 17:02:09,605 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
        at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:449) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) [flink-dist_2.12-1.13.0.jar:1.13.0]
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
ngleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]

On Wed, May 5, 2021 at 6:59 AM Robert Metzger <[hidden email]> wrote:
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490
Reply | Threaded
Open this post in threaded view
|

Re: Interacting with flink-jobmanager via CLI in separate pod

Robert Cullen

I resolved this by changing the jobmanager-rest-service.yaml (Changed type to ClusterIP and removed nodePort

apiVersion: v1 
kind: Service 
metadata: 
  name: flink-jobmanager-rest 
spec: 
  type: ClusterIP 
  ports: 
  - name: rest 
    port: 8081 
    targetPort: 8081 
    #nodePort: 30081 
  selector: 
    app: flink 
    component: jobmanager

On Wed, May 5, 2021 at 10:28 PM Yang Wang <[hidden email]> wrote:
It seems that you are using the NodePort to expose the rest service. If you only want to access the Flink UI/rest in the K8s cluster,
then I would suggest to set "kubernetes.rest-service.exposed.type" to "ClusterIP". Because we are using the K8s master node to
construct the JobManager rest endpoint when using NodePort. Sometime, it is not accessible due to firewall.

Best,
Yang

Robert Metzger <[hidden email]> 于2021年5月6日周四 上午2:08写道:
Okay, it appears to have resolved 10.43.0.1:30081 as the address of the JobManager. Most likely, the container can not access this address. Can you validate this from within the container?

If I understand the Flink documentation correctly, you should be able to manually specify rest.addressrest.port for the JobManager address. If you can manually figure out an address to the JobManager service, and pass it to Flink, the submission should work.

On Wed, May 5, 2021 at 7:15 PM Robert Cullen <[hidden email]> wrote:

Thanks for the reply. Here is an updated exception with DEBUG on. It appears to be timing out:

2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting namespace of Kubernetes client to cmdaa
2021-05-05 16:56:19,700 DEBUG org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory [] - Setting max concurrent requests of Kubernetes client to 64
2021-05-05 16:56:20,176 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster flink-jobmanager successfully, JobManager Web Interface: http://10.43.0.1:30081
2021-05-05 16:56:20,239 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Waiting for response...
2021-05-05 17:02:09,605 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
        at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:449) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) [flink-dist_2.12-1.13.0.jar:1.13.0]
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_292]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_292]
        at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:430) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: /10.43.0.1:30081
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
ngleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.12-1.13.0.jar:1.13.0]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]

On Wed, May 5, 2021 at 6:59 AM Robert Metzger <[hidden email]> wrote:
Hi,
can you check the client log in the "log/" directory?
The Flink client will try to access the K8s API server to retrieve the endpoint of the jobmanager. For that, the pod needs to have permissions (through a service account) to make such calls to K8s. My hope is that the logs or previous messages are giving an indication into what Flink is trying to do.
Can you also try running on DEBUG log level? (should be the log4j-cli.properties file).



On Tue, May 4, 2021 at 3:17 PM Robert Cullen <[hidden email]> wrote:

I have a flink cluster running in kubernetes, just the basic installation with one JobManager and two TaskManagers. I want to interact with it via command line from a separate container ie:

root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=job-manager

How do you interact in the same kubernetes instance via CLI (Not from the desktop)?  This is the exception:

------------------------------------------------------------
 The program finished with the following exception:

java.lang.RuntimeException: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:103)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:145)
        at org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:67)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1001)
        at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
        at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1060)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
        at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the rest endpoint of job-manager
        ... 9 more
root@flink-client:/opt/flink#
--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490


--
Robert Cullen
240-475-4490