Hi,
Is there a way for a script to be called whenever a job gets restarted? My scenario is lets say there are 20 slots and the job runs on all 20 slots. After a while a task manager goes down and now there are only 14 slots and I need to readjust the parallelism of my job to ensure the job runs until the lost TM comes up again. It would be great to know how others are handling this situation. Thanks, Navneeth |
Hi Navneeth,
If I understand correctly, you have a job with parallelism p=20, a TM goes down (eg. with 4 slots), and you want until the TM comes up, to run the job with p=16 and then re-running it with 20 again, when the TM comes up. If this is the case, one important thing to keep in mind is that when a TM fails, the whole job restarts, and not only the tasks that were running on that TM. Given this, and assuming that the lost TM will not take long until it comes up, I am not sure if you save anything by starting a job with parallelism = 20, then restarting it with parallelism of 16 (in your example) until the TM comes up, and then taking a savepoint, stopping it and restarting it with parallelism 20 again. If you still want to do it, one way you can can do it, is to use the REST API to get the necessary information about your cluster and the state of your job and write a script that takes the necessary actions, e.g. resubmit a job with different parallelism. I hope this helps, Kostas > On Mar 29, 2018, at 8:02 PM, Navneeth Krishnan <[hidden email]> wrote: > > Hi, > > Is there a way for a script to be called whenever a job gets restarted? My scenario is lets say there are 20 slots and the job runs on all 20 slots. After a while a task manager goes down and now there are only 14 slots and I need to readjust the parallelism of my job to ensure the job runs until the lost TM comes up again. It would be great to know how others are handling this situation. > > Thanks, > Navneeth |
Hi Navneeth, I am sending the answer to the user mailing list so that we keep the discussion public. There may also be other users interested in the question. So the answer to the question is that you cannot restart from an externalized checkpoint with a different parallelism. To be able to do so, you have to take a savepoint. You can find more on this in [1]. Thanks, Kostas
|
Free forum by Nabble | Edit this page |