(DEPRECATED) Apache Flink User Mailing List archive.

Flink on Kubernetes

Classic

List

Threaded

2 messages Options

Ivan Yang

Flink on Kubernetes

Hi,

I have setup Filnk 1.9.1 on Kubernetes on AWS EKS with one job manager pod, 10 task manager pods, one pod per EC2 instance. Job runs fine. After a while, for some reason, one pod (task manager) crashed, then the pod restarted. After that, the job got into a bad state. All the parallelisms are showing different color (orange, purple) on the console. I had to basically stop the entire job. My question is should a task manager restart affect the entire cluster/job? Or should it join back gracefully?

Second question is regarding to auto scaling Flink cluster on kubernetes. If I add more nodes/pods (task manager containers) to the cluster, will a running Flink job redistribute load to the additional resources or I have to stop to a savepoint, and restart the job?

Thanks and regards.
Ivan

Yang Wang

Re: Flink on Kubernetes

Hi lvan Yang,

#1. If a TaskManager crashed exceptionally and there are some running task on it, it

could not join back gracefully. Whether the full job will be restarted depends on the

failover strategies[1].

#2. Currently, when new TaskManagers join to the Flink cluster, the running Flink

job could not rescale automatically. You need to stop with a savepoint and restart

the job manually. The community is still working on this. And you could find more

information in this ticket[2].

[1]. https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html#failover-strategies

[2]. https://issues.apache.org/jira/browse/FLINK-10407

Best,

Yang

Ivan Yang <[hidden email]> 于2020年5月21日周四下午3:01写道：

Hi,

I have setup Filnk 1.9.1 on Kubernetes on AWS EKS with one job manager pod, 10 task manager pods, one pod per EC2 instance. Job runs fine. After a while, for some reason, one pod (task manager) crashed, then the pod restarted. After that, the job got into a bad state. All the parallelisms are showing different color (orange, purple) on the console. I had to basically stop the entire job. My question is should a task manager restart affect the entire cluster/job? Or should it join back gracefully?

Second question is regarding to auto scaling Flink cluster on kubernetes. If I add more nodes/pods (task manager containers) to the cluster, will a running Flink job redistribute load to the additional resources or I have to stop to a savepoint, and restart the job?

Thanks and regards.
Ivan