(DEPRECATED) Apache Flink User Mailing List archive.

Re: Task-manager kubernetes pods take a long time to terminate

Posted by Andrey Zagrebin-5 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Task-manager-kubernetes-pods-take-a-long-time-to-terminate-tp32479p32632.html

Hi guys,

It looks suspicious that the TM pod termination is potentially delayed by the reconnect to a killed JM.

I created an issue to investigate this:
https://issues.apache.org/jira/browse/FLINK-15946
Let's continue the discussion there.

Best,

Andrey

On Wed, Feb 5, 2020 at 11:49 AM Yang Wang <[hidden email]> wrote:

Maybe you need to check the kubelet logs to see why it get stuck in the "Terminating" state
for long time. Even it needs to clean up the ephemeral storage, it should not take so long
time.

Best,
Yang

Li Peng <[hidden email]> 于2020年2月5日周三上午10:42写道：
My yml files follow most of the instructions here:

http://shzhangji.com/blog/2019/08/24/deploy-flink-job-cluster-on-kubernetes/

What command did you use to delete the deployments? I use : helm --tiller-namespace prod delete --purge my-deployment

I noticed that for environments without much data (like staging), this works flawlessly, but in production with high volume of data, it gets stuck in a loop. I suspect that the extra time needed to cleanup the task managers with high traffic, delays the shutdown until after the job manager terminates, and then the task manager gets stuck in a loop when it detects the job manager is dead.

Thanks,
Li