(DEPRECATED) Apache Flink User Mailing List archive.

Clarification about Flink's managed memory and metric monitoring

Classic

List

Threaded

4 messages Options

sardaesp

Clarification about Flink's managed memory and metric monitoring

Hello,

I have a Flink TM configured with taskmanager.memory.managed.size: 1372m. There is a streaming job using RocksDB for checkpoints, so I assume some of this memory will indeed be used.

I was looking at the metrics exposed through the REST interface, and I queried some of them:

/taskmanagers/c3c960d79c1eb2341806bfa2b2d66328/metrics?get=Status.JVM.Memory.Heap.Committed,Status.JVM.Memory.NonHeap.Committed,Status.JVM.Memory.Direct.MemoryUsed | jq

[

{

"id": "Status.JVM.Memory.Heap.Committed",

"value": "1652031488"

{

"id": "Status.JVM.Memory.NonHeap.Committed",

"value": "234291200" 223 MiB

{

"id": "Status.JVM.Memory.Direct.MemoryUsed",

"value": "375015427" 358 MiB

{

"id": "Status.JVM.Memory.Direct.TotalCapacity",

"value": "375063552" 358 MiB

}

]

I presume direct memory is being used by Flink and its networking stack, as well as by the JVM itself. To be sure:

Increasing "taskmanager.memory.framework.off-heap.size" or "taskmanager.memory.task.off-heap.size" should increase Status.JVM.Memory.Direct.TotalCapacity, right?
I presume the native memory used by RocksDB cannot be tracked with these JVM metrics even if "state.backend.rocksdb.memory.managed" is true, right?

Based on this question: https://stackoverflow.com/questions/30622818/off-heap-native-heap-direct-memory-and-native-memory, I imagine Flink/RocksDB either allocates memory completely independently of the JVM, or it uses unsafe. Since the documentation (https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html#managed-memory) states that "Managed memory is managed by Flink and is allocated as native memory (off-heap)", I thought this native memory might show up as part of direct memory tracking, but I guess it doesn’t.

Regards,

Alexis.

Xintong Song

Re: Clarification about Flink's managed memory and metric monitoring

Hi Alexis,

First of all, I strongly recommend not to look into the JVM metrics. These metrics are fetched directly from JVM and do not well correspond to Flink's memory configurations. They were introduced a long time ago and are preserved mostly for compatibility. IMO, they bring more confusion than convenience. In Flink-1.12, there is a newly designed TM metrics page in the web ui, which clearly shows how the metrics correspond to Flink's memory configurations (if any).

Concerning your questions.

1. Yes, increasing framework/task off-heap memory sizes should increase the direct memory capacity. Increasing the network memory size should also do that.

2. When 'state.backend.rocksdb.memory.managed' is true, RocksDB uses managed memory. Managed memory is not measured by any JVM metrics. It's not managed by JVM, meaning that it's not limited by '-XX:MaxDirectMemorySize' and is not controlled by the garbage collectors.

Thank you~

Xintong Song

On Tue, Apr 13, 2021 at 7:53 PM Alexis Sarda-Espinosa <[hidden email]> wrote:

Hello,

I have a Flink TM configured with taskmanager.memory.managed.size: 1372m. There is a streaming job using RocksDB for checkpoints, so I assume some of this memory will indeed be used.

I was looking at the metrics exposed through the REST interface, and I queried some of them:

/taskmanagers/c3c960d79c1eb2341806bfa2b2d66328/metrics?get=Status.JVM.Memory.Heap.Committed,Status.JVM.Memory.NonHeap.Committed,Status.JVM.Memory.Direct.MemoryUsed | jq

[

{

    "id": "Status.JVM.Memory.Heap.Committed",

    "value": "1652031488"

},

{

    "id": "Status.JVM.Memory.NonHeap.Committed",

    "value": "234291200"                                                             223 MiB

},

{

    "id": "Status.JVM.Memory.Direct.MemoryUsed",

    "value": "375015427"                                                            358 MiB

},

{

    "id": "Status.JVM.Memory.Direct.TotalCapacity",

    "value": "375063552"                                                            358 MiB

}

]

I presume direct memory is being used by Flink and its networking stack, as well as by the JVM itself. To be sure:

Increasing "taskmanager.memory.framework.off-heap.size" or "taskmanager.memory.task.off-heap.size" should increase Status.JVM.Memory.Direct.TotalCapacity, right?
I presume the native memory used by RocksDB cannot be tracked with these JVM metrics even if "state.backend.rocksdb.memory.managed" is true, right?

Based on this question: https://stackoverflow.com/questions/30622818/off-heap-native-heap-direct-memory-and-native-memory, I imagine Flink/RocksDB either allocates memory completely independently of the JVM, or it uses unsafe. Since the documentation (https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html#managed-memory) states that "Managed memory is managed by Flink and is allocated as native memory (off-heap)", I thought this native memory might show up as part of direct memory tracking, but I guess it doesn’t.

Regards,

Alexis.

sardaesp

RE: Clarification about Flink's managed memory and metric monitoring

Hi Xintong,

Thanks for the info. Is there any way to access these metrics outside of the UI? I suppose Flink’s reporters might provide them, but will they also be available through the REST interface (or another interface)?

Regards,

Alexis.

From: Xintong Song <[hidden email]>
Sent: Tuesday, 13 April 2021 14:30
To: Alexis Sarda-Espinosa <[hidden email]>
Cc: [hidden email]
Subject: Re: Clarification about Flink's managed memory and metric monitoring

Hi Alexis,

Concerning your questions.

1. Yes, increasing framework/task off-heap memory sizes should increase the direct memory capacity. Increasing the network memory size should also do that.

Thank you~

Xintong Song

On Tue, Apr 13, 2021 at 7:53 PM Alexis Sarda-Espinosa <[hidden email]> wrote:

Hello,

I have a Flink TM configured with taskmanager.memory.managed.size: 1372m. There is a streaming job using RocksDB for checkpoints, so I assume some of this memory will indeed be used.

I was looking at the metrics exposed through the REST interface, and I queried some of them:

/taskmanagers/c3c960d79c1eb2341806bfa2b2d66328/metrics?get=Status.JVM.Memory.Heap.Committed,Status.JVM.Memory.NonHeap.Committed,Status.JVM.Memory.Direct.MemoryUsed | jq

[

{

    "id": "Status.JVM.Memory.Heap.Committed",

    "value": "1652031488"

},

{

    "id": "Status.JVM.Memory.NonHeap.Committed",

    "value": "234291200"                                                             223 MiB

},

{

    "id": "Status.JVM.Memory.Direct.MemoryUsed",

    "value": "375015427"                                                            358 MiB

},

{

    "id": "Status.JVM.Memory.Direct.TotalCapacity",

    "value": "375063552"                                                            358 MiB

}

]

I presume direct memory is being used by Flink and its networking stack, as well as by the JVM itself. To be sure:

Increasing "taskmanager.memory.framework.off-heap.size" or "taskmanager.memory.task.off-heap.size" should increase Status.JVM.Memory.Direct.TotalCapacity, right?
I presume the native memory used by RocksDB cannot be tracked with these JVM metrics even if "state.backend.rocksdb.memory.managed" is true, right?

Based on this question: https://stackoverflow.com/questions/30622818/off-heap-native-heap-direct-memory-and-native-memory, I imagine Flink/RocksDB either allocates memory completely independently of the JVM, or it uses unsafe. Since the documentation (https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html#managed-memory) states that "Managed memory is managed by Flink and is allocated as native memory (off-heap)", I thought this native memory might show up as part of direct memory tracking, but I guess it doesn’t.

Regards,

Alexis.

Xintong Song

Re: Clarification about Flink's managed memory and metric monitoring

These metrics should also be available via REST.

You can check the original design doc [1] for which metrics the UI is using.

Thank you~

Xintong Song

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager

On Tue, Apr 13, 2021 at 9:08 PM Alexis Sarda-Espinosa <[hidden email]> wrote:

Hi Xintong,

Thanks for the info. Is there any way to access these metrics outside of the UI? I suppose Flink’s reporters might provide them, but will they also be available through the REST interface (or another interface)?

Regards,

Alexis.

From: Xintong Song <[hidden email]>
Sent: Tuesday, 13 April 2021 14:30
To: Alexis Sarda-Espinosa <[hidden email]>
Cc: [hidden email]
Subject: Re: Clarification about Flink's managed memory and metric monitoring

Hi Alexis,

First of all, I strongly recommend not to look into the JVM metrics. These metrics are fetched directly from JVM and do not well correspond to Flink's memory configurations. They were introduced a long time ago and are preserved mostly for compatibility. IMO, they bring more confusion than convenience. In Flink-1.12, there is a newly designed TM metrics page in the web ui, which clearly shows how the metrics correspond to Flink's memory configurations (if any).

Concerning your questions.

1. Yes, increasing framework/task off-heap memory sizes should increase the direct memory capacity. Increasing the network memory size should also do that.

2. When 'state.backend.rocksdb.memory.managed' is true, RocksDB uses managed memory. Managed memory is not measured by any JVM metrics. It's not managed by JVM, meaning that it's not limited by '-XX:MaxDirectMemorySize' and is not controlled by the garbage collectors.

Thank you~

Xintong Song

On Tue, Apr 13, 2021 at 7:53 PM Alexis Sarda-Espinosa <[hidden email]> wrote:

Hello,

I have a Flink TM configured with taskmanager.memory.managed.size: 1372m. There is a streaming job using RocksDB for checkpoints, so I assume some of this memory will indeed be used.

I was looking at the metrics exposed through the REST interface, and I queried some of them:

/taskmanagers/c3c960d79c1eb2341806bfa2b2d66328/metrics?get=Status.JVM.Memory.Heap.Committed,Status.JVM.Memory.NonHeap.Committed,Status.JVM.Memory.Direct.MemoryUsed | jq

[

{

    "id": "Status.JVM.Memory.Heap.Committed",

    "value": "1652031488"

},

{

    "id": "Status.JVM.Memory.NonHeap.Committed",

    "value": "234291200"                                                             223 MiB

},

{

    "id": "Status.JVM.Memory.Direct.MemoryUsed",

    "value": "375015427"                                                            358 MiB

},

{

    "id": "Status.JVM.Memory.Direct.TotalCapacity",

    "value": "375063552"                                                            358 MiB

}

]

I presume direct memory is being used by Flink and its networking stack, as well as by the JVM itself. To be sure:

Increasing "taskmanager.memory.framework.off-heap.size" or "taskmanager.memory.task.off-heap.size" should increase Status.JVM.Memory.Direct.TotalCapacity, right?
I presume the native memory used by RocksDB cannot be tracked with these JVM metrics even if "state.backend.rocksdb.memory.managed" is true, right?

Based on this question: https://stackoverflow.com/questions/30622818/off-heap-native-heap-direct-memory-and-native-memory, I imagine Flink/RocksDB either allocates memory completely independently of the JVM, or it uses unsafe. Since the documentation (https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html#managed-memory) states that "Managed memory is managed by Flink and is allocated as native memory (off-heap)", I thought this native memory might show up as part of direct memory tracking, but I guess it doesn’t.

Regards,

Alexis.