Re: Not enough free slots to run the job
Posted by
Ovidiu-Cristian MARCU on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Not-enough-free-slots-to-run-the-job-tp5630p5676.html
Hi Robert,
I am not sure I understand so please confirm if I understand correctly your suggestions:
- to use less slots than available slots capacity to avoid issues like when a TaskManager is not giving its slots because of some problems registering the TM;
(This means I will lose some performance by not using all the available capacity)
-if a job is failing because of losing a TaskManager (and its slots) the job will not restart even if available slots are free to use.
(for this case the ‘spare slots’ will not be of help right; losing a TM means the job will fail, no recovery)
Thanks!
Best,
Ovidiu
Hi Ovidiu,
right now the scheduler in Flink will not use more slots than requested.
To avoid issues on recovery, we usually recommend users to have some spare slots (run job with p=15 on a cluster with slots=20). I agree that it would make sense to add a flag which allows a job to grab more slots if they are available. The problem with that is however, that jobs can currently not change their parallelism. So if a job fails, it can not downscale to restart on the remaining slots.
That's why the spare slots approach is currently the only way to go.
Regards,
Robert