YARN Reserved Memory

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

YARN Reserved Memory

Michael Pisula
Hi Guys,

We are having a slight issue using Flink 1.1.3 (we also observed the
problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN
seems to reserve memory during the TaskManager restart, and not free the
memory again. We are using a CapacityScheduler with 2 queues, where the
queue in which our Flink Yarn Session runs has a guaranteed capacity of
75%. What we are seeing, is that the amount of reserved memory is
exactly the amount of memory available in the queue after the
TaskManager is crashed.

On our test system, further TaskManager restarts have been able to get
rid of the TaskManager again. When trying to replicate this on our
production system I was not successful, one difference being, that I
killed a TaskManager with no used slots in prod, while on the test
system jobs were restarted.

Nothing enlightening in the logs, unfortunately.

Is this something that anyone has experienced so far?

Cheers,

Michael


--
Michael Pisula * [hidden email] * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082



signature.asc (503 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: YARN Reserved Memory

Stefan Richter
Hi,

did you observe the problem only under YARN 2.4.0? IIRC this version of YARN has some problems that can also lead to issues with Flink’s HA mode, and I would encourage you to upgrade YARN to 2.5 or higher. On a different note, there have been several improvements that we will release in Flink 1.1.4, not entirely sure if this is a known problem covered by the upcoming bugfix release. I will add Till to the discussion who worked a lot in this direction.

Best,
Stefan

> Am 07.12.2016 um 09:19 schrieb Michael Pisula <[hidden email]>:
>
> Hi Guys,
>
> We are having a slight issue using Flink 1.1.3 (we also observed the
> problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN
> seems to reserve memory during the TaskManager restart, and not free the
> memory again. We are using a CapacityScheduler with 2 queues, where the
> queue in which our Flink Yarn Session runs has a guaranteed capacity of
> 75%. What we are seeing, is that the amount of reserved memory is
> exactly the amount of memory available in the queue after the
> TaskManager is crashed.
>
> On our test system, further TaskManager restarts have been able to get
> rid of the TaskManager again. When trying to replicate this on our
> production system I was not successful, one difference being, that I
> killed a TaskManager with no used slots in prod, while on the test
> system jobs were restarted.
>
> Nothing enlightening in the logs, unfortunately.
>
> Is this something that anyone has experienced so far?
>
> Cheers,
>
> Michael
>
>
> --
> Michael Pisula * [hidden email] * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

Reply | Threaded
Open this post in threaded view
|

Re: YARN Reserved Memory

Michael Pisula
Hi Stefan,

thanks for the fast feedback. Updating to a newer YARN Version is most
certainly something that would benefit us in many different areas (the
issues with the HA mode being the most important of them), however at
the moment we are not able to update to a newer version. If that is
another of those cases where our outdated YARN version is cause for a
problem, that would at least give us more arguments to prioritize the
upgrade ;-)

Cheers,

Michael


On 07.12.2016 11:33, Stefan Richter wrote:

> Hi,
>
> did you observe the problem only under YARN 2.4.0? IIRC this version of YARN has some problems that can also lead to issues with Flink’s HA mode, and I would encourage you to upgrade YARN to 2.5 or higher. On a different note, there have been several improvements that we will release in Flink 1.1.4, not entirely sure if this is a known problem covered by the upcoming bugfix release. I will add Till to the discussion who worked a lot in this direction.
>
> Best,
> Stefan
>
>> Am 07.12.2016 um 09:19 schrieb Michael Pisula <[hidden email]>:
>>
>> Hi Guys,
>>
>> We are having a slight issue using Flink 1.1.3 (we also observed the
>> problem with 1.0.2) in Yarn 2.4.0. Whenever a TaskManager restarts, YARN
>> seems to reserve memory during the TaskManager restart, and not free the
>> memory again. We are using a CapacityScheduler with 2 queues, where the
>> queue in which our Flink Yarn Session runs has a guaranteed capacity of
>> 75%. What we are seeing, is that the amount of reserved memory is
>> exactly the amount of memory available in the queue after the
>> TaskManager is crashed.
>>
>> On our test system, further TaskManager restarts have been able to get
>> rid of the TaskManager again. When trying to replicate this on our
>> production system I was not successful, one difference being, that I
>> killed a TaskManager with no used slots in prod, while on the test
>> system jobs were restarted.
>>
>> Nothing enlightening in the logs, unfortunately.
>>
>> Is this something that anyone has experienced so far?
>>
>> Cheers,
>>
>> Michael
>>
>>
>> --
>> Michael Pisula * [hidden email] * +49-174-3180084
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
--
Michael Pisula * [hidden email] * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082



signature.asc (503 bytes) Download Attachment