Hi Guys,
We are having a slight issue using Flink 1.1.3 (we also observed the problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN seems to reserve memory during the TaskManager restart, and not free the memory again. We are using a CapacityScheduler with 2 queues, where the queue in which our Flink Yarn Session runs has a guaranteed capacity of 75%. What we are seeing, is that the amount of reserved memory is exactly the amount of memory available in the queue after the TaskManager is crashed. On our test system, further TaskManager restarts have been able to get rid of the TaskManager again. When trying to replicate this on our production system I was not successful, one difference being, that I killed a TaskManager with no used slots in prod, while on the test system jobs were restarted. Nothing enlightening in the logs, unfortunately. Is this something that anyone has experienced so far? Cheers, Michael -- Michael Pisula * [hidden email] * +49-174-3180084 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 signature.asc (503 bytes) Download Attachment |
Hi,
did you observe the problem only under YARN 2.4.0? IIRC this version of YARN has some problems that can also lead to issues with Flink’s HA mode, and I would encourage you to upgrade YARN to 2.5 or higher. On a different note, there have been several improvements that we will release in Flink 1.1.4, not entirely sure if this is a known problem covered by the upcoming bugfix release. I will add Till to the discussion who worked a lot in this direction. Best, Stefan > Am 07.12.2016 um 09:19 schrieb Michael Pisula <[hidden email]>: > > Hi Guys, > > We are having a slight issue using Flink 1.1.3 (we also observed the > problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN > seems to reserve memory during the TaskManager restart, and not free the > memory again. We are using a CapacityScheduler with 2 queues, where the > queue in which our Flink Yarn Session runs has a guaranteed capacity of > 75%. What we are seeing, is that the amount of reserved memory is > exactly the amount of memory available in the queue after the > TaskManager is crashed. > > On our test system, further TaskManager restarts have been able to get > rid of the TaskManager again. When trying to replicate this on our > production system I was not successful, one difference being, that I > killed a TaskManager with no used slots in prod, while on the test > system jobs were restarted. > > Nothing enlightening in the logs, unfortunately. > > Is this something that anyone has experienced so far? > > Cheers, > > Michael > > > -- > Michael Pisula * [hidden email] * +49-174-3180084 > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke > Sitz: Unterföhring * Amtsgericht München * HRB 135082 > > |
Hi Stefan,
thanks for the fast feedback. Updating to a newer YARN Version is most certainly something that would benefit us in many different areas (the issues with the HA mode being the most important of them), however at the moment we are not able to update to a newer version. If that is another of those cases where our outdated YARN version is cause for a problem, that would at least give us more arguments to prioritize the upgrade ;-) Cheers, Michael On 07.12.2016 11:33, Stefan Richter wrote: > Hi, > > did you observe the problem only under YARN 2.4.0? IIRC this version of YARN has some problems that can also lead to issues with Flink’s HA mode, and I would encourage you to upgrade YARN to 2.5 or higher. On a different note, there have been several improvements that we will release in Flink 1.1.4, not entirely sure if this is a known problem covered by the upcoming bugfix release. I will add Till to the discussion who worked a lot in this direction. > > Best, > Stefan > >> Am 07.12.2016 um 09:19 schrieb Michael Pisula <[hidden email]>: >> >> Hi Guys, >> >> We are having a slight issue using Flink 1.1.3 (we also observed the >> problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN >> seems to reserve memory during the TaskManager restart, and not free the >> memory again. We are using a CapacityScheduler with 2 queues, where the >> queue in which our Flink Yarn Session runs has a guaranteed capacity of >> 75%. What we are seeing, is that the amount of reserved memory is >> exactly the amount of memory available in the queue after the >> TaskManager is crashed. >> >> On our test system, further TaskManager restarts have been able to get >> rid of the TaskManager again. When trying to replicate this on our >> production system I was not successful, one difference being, that I >> killed a TaskManager with no used slots in prod, while on the test >> system jobs were restarted. >> >> Nothing enlightening in the logs, unfortunately. >> >> Is this something that anyone has experienced so far? >> >> Cheers, >> >> Michael >> >> >> -- >> Michael Pisula * [hidden email] * +49-174-3180084 >> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >> >> Michael Pisula * [hidden email] * +49-174-3180084 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 signature.asc (503 bytes) Download Attachment |
Free forum by Nabble | Edit this page |