Native kubernetes setup failed to start job

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Native kubernetes setup failed to start job

Chen Liangde

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded
2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]
2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.
2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
Reply | Threaded
Open this post in threaded view
|

Re: Native kubernetes setup failed to start job

Yun Gao
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun
------------------Original Mail ------------------
Sender:Chen Liangde <[hidden email]>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <[hidden email]>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
Reply | Threaded
Open this post in threaded view
|

Re: Native kubernetes setup failed to start job

Yang Wang
Could you share the JobManager logs so that we could check whether it received the
registration from TasManager?

In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager.
Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc`
to find it. And then start a busybox to check the network connectivity.

And maybe you could share more information about the environment. I could not reproduce
your issue in a typical K8s cluster.

Best,
Yang

Yun Gao <[hidden email]> 于2020年10月30日周五 上午11:53写道:
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun
------------------Original Mail ------------------
Sender:Chen Liangde <[hidden email]>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <[hidden email]>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
Reply | Threaded
Open this post in threaded view
|

Re: Native kubernetes setup failed to start job

Chen Liangde
Please find attached logs. 

The kubernetes cluster is an aws EKS cluster but managed by our infra's team. 
I created a service account "flink" for it and it has permission to create, list, delete pods along with  some other types of resources in the "team-anti-cheat" namespace. 

Below command was used to create the flink cluster: 
./bin/kubernetes-session.sh \
        -Dexecution.attached=true \
        -Dkubernetes.cluster-id=detection-engine-dev \
        -Dkubernetes.namespace=team-anti-cheat \
        -Dkubernetes.container-start-command-template="%java% %classpath% %jvmmem% %jvmopts% %logging% %class% %args%" \
        -Dkubernetes.jobmanager.service-account=flink

Thanks
Liangde Chen


On Mon, 2 Nov 2020 at 08:20, Yang Wang <[hidden email]> wrote:
Could you share the JobManager logs so that we could check whether it received the
registration from TasManager?

In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager.
Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc`
to find it. And then start a busybox to check the network connectivity.

And maybe you could share more information about the environment. I could not reproduce
your issue in a typical K8s cluster.

Best,
Yang

Yun Gao <[hidden email]> 于2020年10月30日周五 上午11:53写道:
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun
------------------Original Mail ------------------
Sender:Chen Liangde <[hidden email]>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <[hidden email]>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer

kube-svc.txt (730 bytes) Download Attachment
jobmanager.log (66K) Download Attachment
taskmanager.log (42K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Native kubernetes setup failed to start job

Yang Wang
Hi Liangde Chen,

Thanks for providing the logs. After checking the logs, I am afraid that there is something wrong with
your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been started and registered to JobManager
successfully. 

I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is running on and disable
the scheduling on it. Then restart the Flink K8s session and have a try again.

Best,
Yang

Chen Liangde <[hidden email]> 于2020年11月2日周一 下午3:55写道:
Please find attached logs. 

The kubernetes cluster is an aws EKS cluster but managed by our infra's team. 
I created a service account "flink" for it and it has permission to create, list, delete pods along with  some other types of resources in the "team-anti-cheat" namespace. 

Below command was used to create the flink cluster: 
./bin/kubernetes-session.sh \
        -Dexecution.attached=true \
        -Dkubernetes.cluster-id=detection-engine-dev \
        -Dkubernetes.namespace=team-anti-cheat \
        -Dkubernetes.container-start-command-template="%java% %classpath% %jvmmem% %jvmopts% %logging% %class% %args%" \
        -Dkubernetes.jobmanager.service-account=flink

Thanks
Liangde Chen


On Mon, 2 Nov 2020 at 08:20, Yang Wang <[hidden email]> wrote:
Could you share the JobManager logs so that we could check whether it received the
registration from TasManager?

In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager.
Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc`
to find it. And then start a busybox to check the network connectivity.

And maybe you could share more information about the environment. I could not reproduce
your issue in a typical K8s cluster.

Best,
Yang

Yun Gao <[hidden email]> 于2020年10月30日周五 上午11:53写道:
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun
------------------Original Mail ------------------
Sender:Chen Liangde <[hidden email]>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <[hidden email]>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
Reply | Threaded
Open this post in threaded view
|

Re: Native kubernetes setup failed to start job

Yang Wang
Sorry, I overlooked the logs for detection-engine-dev-taskmanager-1-1.

Could you start a busybox to check the connectivity for the K8s service "detection-engine-dev"?
It seems that the TaskManager try to connect and get a response "Connection reset by peer".

Best,
Yang

Yang Wang <[hidden email]> 于2020年11月2日周一 下午5:41写道:
Hi Liangde Chen,

Thanks for providing the logs. After checking the logs, I am afraid that there is something wrong with
your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been started and registered to JobManager
successfully. 

I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is running on and disable
the scheduling on it. Then restart the Flink K8s session and have a try again.

Best,
Yang

Chen Liangde <[hidden email]> 于2020年11月2日周一 下午3:55写道:
Please find attached logs. 

The kubernetes cluster is an aws EKS cluster but managed by our infra's team. 
I created a service account "flink" for it and it has permission to create, list, delete pods along with  some other types of resources in the "team-anti-cheat" namespace. 

Below command was used to create the flink cluster: 
./bin/kubernetes-session.sh \
        -Dexecution.attached=true \
        -Dkubernetes.cluster-id=detection-engine-dev \
        -Dkubernetes.namespace=team-anti-cheat \
        -Dkubernetes.container-start-command-template="%java% %classpath% %jvmmem% %jvmopts% %logging% %class% %args%" \
        -Dkubernetes.jobmanager.service-account=flink

Thanks
Liangde Chen


On Mon, 2 Nov 2020 at 08:20, Yang Wang <[hidden email]> wrote:
Could you share the JobManager logs so that we could check whether it received the
registration from TasManager?

In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager.
Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc`
to find it. And then start a busybox to check the network connectivity.

And maybe you could share more information about the environment. I could not reproduce
your issue in a typical K8s cluster.

Best,
Yang

Yun Gao <[hidden email]> 于2020年10月30日周五 上午11:53写道:
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun
------------------Original Mail ------------------
Sender:Chen Liangde <[hidden email]>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <[hidden email]>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.

This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://[hidden email]-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[hidden email]-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer