Hi
We are using 1.10.1 with native k8s and while the service appears to be created and I can submit a job & see it via Web UI, TMs/pods are never created thus the jobs never start. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources. Is there somewhere I could see the pod creation logs? thanks -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi, Kevin,
Regarding logs, you could follow this guide [1]. BTW, you could execute "kubectl get pod" to get the current pods. If there is something like "flink-taskmanager-1-1", you could execute "kubectl describe pod flink-taskmanager-1-1" to see the status of it. [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html#log-files Best, Yangze Guo On Thu, Jun 4, 2020 at 2:28 AM kb <[hidden email]> wrote: > > Hi > > We are using 1.10.1 with native k8s and while the service appears to be > created and I can submit a job & see it via Web UI, TMs/pods are never > created thus the jobs never start. > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate the required slot within slot request timeout. Please > make sure that the cluster has enough resources. > > Is there somewhere I could see the pod creation logs? > > thanks > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Amend: for release 1.10.1, please refer to this guide [1].
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html#log-files Best, Yangze Guo On Thu, Jun 4, 2020 at 9:52 AM Yangze Guo <[hidden email]> wrote: > > Hi, Kevin, > > Regarding logs, you could follow this guide [1]. > > BTW, you could execute "kubectl get pod" to get the current pods. If > there is something like "flink-taskmanager-1-1", you could execute > "kubectl describe pod flink-taskmanager-1-1" to see the status of it. > > [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html#log-files > > Best, > Yangze Guo > > On Thu, Jun 4, 2020 at 2:28 AM kb <[hidden email]> wrote: > > > > Hi > > > > We are using 1.10.1 with native k8s and while the service appears to be > > created and I can submit a job & see it via Web UI, TMs/pods are never > > created thus the jobs never start. > > > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > > Could not allocate the required slot within slot request timeout. Please > > make sure that the cluster has enough resources. > > > > Is there somewhere I could see the pod creation logs? > > > > thanks > > > > > > > > -- > > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
I second Yangze's suggestion. You need to get the jobmanager log first. Then it will be easier to find the root cause. I know that it is not convenient for users to access the log via kubectl and we already have a ticket for this[1]. Usually, the reason that Flink resourcemanager could not allocate taskmanagers from K8s is the service account not configured correctly. You could checkout the RBAC configuration here[2]. Best, Yang Yangze Guo <[hidden email]> 于2020年6月4日周四 上午10:01写道: Amend: for release 1.10.1, please refer to this guide [1]. |
Thanks!
I do not see any pods of the form `flink-taskmanager-1-1`, so I tried the exec suggestion. The logs are attached below. Is there a quick RBAC check I could perform? I followed the command on the docs page linked (kubectl create clusterrolebinding flink-role-binding-default --clusterrole=edit --serviceaccount=default:default). 2020-06-04 15:34:04,711 INFO org.apache.flink.kubernetes.KubernetesResourceManager - Requesting new TaskManager pod with <1728,1.0>. Number pending requests 1. 2020-06-04 15:34:04,712 INFO org.apache.flink.kubernetes.KubernetesResourceManager - TaskManager flink-cluster-e07a6f7a-8bd1-4306-89f1-a1ff7ea17bf6-taskmanager-1-5994 will be started with TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=384.000mb (402653174 bytes), taskOffHeapSize=0 bytes, networkMemSize=128.000mb (134217730 bytes), managedMemorySize=512.000mb (536870920 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=192.000mb (201326592 bytes)}. 2020-06-04 15:34:14,713 ERROR org.apache.flink.kubernetes.KubernetesResourceManager - Could not start TaskManager in pod flink-cluster-e07a6f7a-8bd1-4306-89f1-a1ff7ea17bf6-taskmanager-1-5994. java.util.concurrent.CompletionException: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:331) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:324) at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$0(Fabric8FlinkKubeClient.java:184) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ... 3 more Caused by: java.net.SocketTimeoutException: timeout at org.apache.flink.kubernetes.shadded.okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:656) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:664) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at org.apache.flink.kubernetes.shadded.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at org.apache.flink.kubernetes.shadded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at org.apache.flink.kubernetes.shadded.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254) at org.apache.flink.kubernetes.shadded.okhttp3.RealCall.execute(RealCall.java:92) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:411) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:798) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:328) ... 6 more -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
If you have created the role binding "flink-role-binding-default" successfully, then it should not be the RBAC issue. K8s apiserver due to okhttp issue with java 8u252. Could you add the following config option to disable http2? You could find more information here[1]. kubernetes-session.sh ... -Dcontainerized.master.env.HTTP2_DISABLE=true Best, Yang kb <[hidden email]> 于2020年6月4日周四 下午11:40写道: Thanks! |
Thanks Yang for the suggestion, I have tried it and I'm still getting the
same exception. Is it possible its due to the null pod name? Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed. Best, kevin -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Kevin,
It may because the characters length limitation of K8s(no more than 63)[1]. So the pod name could not be too long. I notice that you are using the client automatic generated cluster-id. It may cause problem and could you set a meaningful cluster-id for your Flink session? For example, kubernetes-session.sh ... -Dkubernetes.cluster-id=my-flink-k8s-session This behavior has been improved in Flink 1.11 to check the length in client side before submission. If it still could not work, could you share your full command and jobmanager logs? It will help a lot to find the root cause. Best, Yang kb <[hidden email]> 于2020年6月6日周六 上午1:00写道: Thanks Yang for the suggestion, I have tried it and I'm still getting the |
Hi Kevin, Sorry for not notice your last response. Could you share you full DEBUG level jobmanager logs? I will try to figure out whether it is a issue of Flink or K8s. Because i could not reproduce your situation with my local K8s cluster. Best, Yang Yang Wang <[hidden email]> 于2020年6月8日周一 上午11:02写道:
|
Thanks for sharing the DEBUG level log. I carefully check the logs and find that the kubernetes-client discovered the api server address and token successfully. However, it could not contact with api server(10.100.0.1:443). Could you check whether you api server is configured to allow accessing within cluster. I think you could start any pod and tunnel in to run the following command. BTW, what's your kubernetes version? And i am not sure whether increasing the timeoutcould help. -Dcontainerized.master.env.KUBERNETES_REQUEST_TIMEOUT=60000 -Dcontainerized.master.env.KUBERNETES_CONNECTION_TIMEOUT=60000 Best, Yang Yang Wang <[hidden email]> 于2020年6月16日周二 下午12:00写道:
|
Free forum by Nabble | Edit this page |