stack job on fail over

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

stack job on fail over

Nick Toker
Hi
i have a standalone cluster with 3 nodes  and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is resubmitted
i see that all slots on all nodes are being canceled except from the slots of the dead 
task manager , it takes about 30- 40 second for the job to totally shutdown.
is that something i can do to reduce this time or there is a plan for a fix ( if so when)?

regards,
nick
Reply | Threaded
Open this post in threaded view
|

Re: stack job on fail over

Biao Liu
Hi Nick,

I guess the reason is your Flink job manager doesn't detect the task manager is lost until heartbeat timeout.
You could check the job manager log to verify that.

Maybe a more elegant way of shutting down task manager helps, like through "taskmanager.sh stop" or "kill" command without 9 signal.
Or you could reduce heartbeat interval and timeout through configuration "heartbeat.interval" and "heartbeat.timeout".

Thanks,
Biao /'bɪ.aʊ/



On Tue, 26 Nov 2019 at 16:09, Nick Toker <[hidden email]> wrote:
Hi
i have a standalone cluster with 3 nodes  and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is resubmitted
i see that all slots on all nodes are being canceled except from the slots of the dead 
task manager , it takes about 30- 40 second for the job to totally shutdown.
is that something i can do to reduce this time or there is a plan for a fix ( if so when)?

regards,
nick
Reply | Threaded
Open this post in threaded view
|

Re: stack job on fail over

Biao Liu
Hi Nick,

Yes, reducing heartbeat timeout is not a perfect solution. It just alleviates the pain a bit.

I'm wondering my guess is right or not. Is it caused by heartbeat detection? Does it help with an elegant way of shutting down?

Thanks,
Biao /'bɪ.aʊ/



On Tue, 26 Nov 2019 at 20:22, Nick Toker <[hidden email]> wrote:
Thanks  
its to the trick


regards,
nick

On Tue, Nov 26, 2019 at 11:26 AM Biao Liu <[hidden email]> wrote:
Hi Nick,

I guess the reason is your Flink job manager doesn't detect the task manager is lost until heartbeat timeout.
You could check the job manager log to verify that.

Maybe a more elegant way of shutting down task manager helps, like through "taskmanager.sh stop" or "kill" command without 9 signal.
Or you could reduce heartbeat interval and timeout through configuration "heartbeat.interval" and "heartbeat.timeout".

Thanks,
Biao /'bɪ.aʊ/



On Tue, 26 Nov 2019 at 16:09, Nick Toker <[hidden email]> wrote:
Hi
i have a standalone cluster with 3 nodes  and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is resubmitted
i see that all slots on all nodes are being canceled except from the slots of the dead 
task manager , it takes about 30- 40 second for the job to totally shutdown.
is that something i can do to reduce this time or there is a plan for a fix ( if so when)?

regards,
nick