I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager. This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded
|
Hi Liangde, I pull in Yang Wang who is the expert for Flink on K8s. Best, Yun
|
Could you share the JobManager logs so that we could check whether it received the registration from TasManager? In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager. Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc` to find it. And then start a busybox to check the network connectivity. And maybe you could share more information about the environment. I could not reproduce your issue in a typical K8s cluster. Best, Yang Yun Gao <[hidden email]> 于2020年10月30日周五 上午11:53写道:
|
Please find attached logs. The kubernetes cluster is an aws EKS cluster but managed by our infra's team. I created a service account "flink" for it and it has permission to create, list, delete pods along with some other types of resources in the "team-anti-cheat" namespace. Below command was used to create the flink cluster: ./bin/kubernetes-session.sh \ -Dexecution.attached=true \ -Dkubernetes.cluster-id=detection-engine-dev \ -Dkubernetes.namespace=team-anti-cheat \ -Dkubernetes.container-start-command-template="%java% %classpath% %jvmmem% %jvmopts% %logging% %class% %args%" \ -Dkubernetes.jobmanager.service-account=flink Thanks Liangde Chen On Mon, 2 Nov 2020 at 08:20, Yang Wang <[hidden email]> wrote:
kube-svc.txt (730 bytes) Download Attachment jobmanager.log (66K) Download Attachment taskmanager.log (42K) Download Attachment |
Hi Liangde Chen, Thanks for providing the logs. After checking the logs, I am afraid that there is something wrong with your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been started and registered to JobManager successfully. I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is running on and disable the scheduling on it. Then restart the Flink K8s session and have a try again. Best, Yang Chen Liangde <[hidden email]> 于2020年11月2日周一 下午3:55写道:
|
Sorry, I overlooked the logs for detection-engine-dev-taskmanager-1-1. Could you start a busybox to check the connectivity for the K8s service "detection-engine-dev"? It seems that the TaskManager try to connect and get a response "Connection reset by peer". Best, Yang Yang Wang <[hidden email]> 于2020年11月2日周一 下午5:41写道:
|
Free forum by Nabble | Edit this page |