(DEPRECATED) Apache Flink User Mailing List archive.

keep-alive job strategy

Classic

List

Threaded

2 messages Options

Rob

keep-alive job strategy

Hello
I have set up a cluster and added taskmanagers manually with bin/taskmanager.sh start.
I noticed that if i have 5 task managers with one slot each and start a job with -p5, then if i stop a taskmanager the job will fail even if there are 4 more taskmanagers.

Is this expected (I turned off restart policy)?
So the way to ensure continuous operation of a single "job" is to have e.g. 10 TM and deploy 10 job instances to fill each of 10 slots?
Or if I have a job that does require -p3 for example, I should always have at least 3 TM alive?

Many thanks!
-Rob

Fabian Hueske-2

Re: keep-alive job strategy

Hi Rob,

yes, this behavior is expected. Flink does not automatically scale-down a job in case of a failure.

You have to ensure that you have enough resources available to continue processing.

In case of Flink's cluster mode, the common practice is to have stand-by TMs available (the same is true for JMs if you need a HA setup).

Best, Fabian

2017-10-06 13:56 GMT+02:00 r. r. <[hidden email]>:

Hello
I have set up a cluster and added taskmanagers manually with bin/taskmanager.sh start.
I noticed that if i have 5 task managers with one slot each and start a job with -p5, then if i stop a taskmanager the job will fail even if there are 4 more taskmanagers.

Is this expected (I turned off restart policy)?
So the way to ensure continuous operation of a single "job" is to have e.g. 10 TM and deploy 10 job instances to fill each of 10 slots?
Or if I have a job that does require -p3 for example, I should always have at least 3 TM alive?

Many thanks!
-Rob