Flink 1.10 container memory configuration with Mesos.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.10 container memory configuration with Mesos.

Alexander Kasyanenko
Hi folks,

I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
	at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
	at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
	at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
	at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
	at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
	at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
	at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
	... 9 more
But when task manager is launched, it correctly parses process memory key:  
2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
...
Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
I would appreciate any suggestion.

Regards and thanks in advance,
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Yangze Guo
Hi, Alexander

I could not reproduce it in my local environment. Normally, Mesos RM
will calculate all the mem config and add it to the launch command.
Unfortunately, all the log I could found for this command is at the
DEBUG level. Would you mind changing the log level to DEBUG or sharing
anything about the taskmanager launch command you could found in the
current log?


Best,
Yangze Guo

On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
<[hidden email]> wrote:

>
> Hi folks,
>
> I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
> I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>
> 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
> org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
> at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
> at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
> ... 9 more
>
> But when task manager is launched, it correctly parses process memory key:
>
> 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
> 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
> 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
> ...
>
> Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
> I would appreciate any suggestion.
>
> Regards and thanks in advance,
> Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Xintong Song
Hi Alex,

Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.

The TM memory configuration process works as follow:
  1. The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
  2. After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
  3. When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.

Thank you~

Xintong Song



On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
Hi, Alexander

I could not reproduce it in my local environment. Normally, Mesos RM
will calculate all the mem config and add it to the launch command.
Unfortunately, all the log I could found for this command is at the
DEBUG level. Would you mind changing the log level to DEBUG or sharing
anything about the taskmanager launch command you could found in the
current log?


Best,
Yangze Guo

On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
<[hidden email]> wrote:
>
> Hi folks,
>
> I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
> I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>
> 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
> org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
> at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
> at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
> ... 9 more
>
> But when task manager is launched, it correctly parses process memory key:
>
> 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
> 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
> 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
> ...
>
> Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
> I would appreciate any suggestion.
>
> Regards and thanks in advance,
> Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Alexander Kasyanenko
Hi Yangze, Xintong,

Thank you for instant response.

And big thanks for the hint on TM launch command. It indeed was the problem. I've added my own custom mesos-taskmanager.sh to echo the launch command (I've switched to DEBUG level on logging, but it didn't really display anything useful). May I suggest to add something like this in the future releases? 

As for my particular case, the issue was in mesos-appmaster.sh option:
-Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
My custom launch script was slicing argument array incorrectly.

Thanks for the help and regards,
Alex.

чт, 12 мар. 2020 г. в 15:46, Xintong Song <[hidden email]>:
Hi Alex,

Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.

The TM memory configuration process works as follow:
  1. The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
  2. After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
  3. When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.

Thank you~

Xintong Song



On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
Hi, Alexander

I could not reproduce it in my local environment. Normally, Mesos RM
will calculate all the mem config and add it to the launch command.
Unfortunately, all the log I could found for this command is at the
DEBUG level. Would you mind changing the log level to DEBUG or sharing
anything about the taskmanager launch command you could found in the
current log?


Best,
Yangze Guo

On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
<[hidden email]> wrote:
>
> Hi folks,
>
> I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
> I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>
> 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
> org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
> at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
> at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
> at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
> at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
> ... 9 more
>
> But when task manager is launched, it correctly parses process memory key:
>
> 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
> 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
> 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
> 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
> 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
> 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
> 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
> 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
> 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
> ...
>
> Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
> I would appreciate any suggestion.
>
> Regards and thanks in advance,
> Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Yangze Guo
Glad to hear that your issue is fixed.
I'm not sure what you suggest to add. Could you tell it more specific
or create a Jira ticket?

Best,
Yangze Guo


On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
<[hidden email]> wrote:

>
> Hi Yangze, Xintong,
>
> Thank you for instant response.
>
> And big thanks for the hint on TM launch command. It indeed was the problem. I've added my own custom mesos-taskmanager.sh to echo the launch command (I've switched to DEBUG level on logging, but it didn't really display anything useful). May I suggest to add something like this in the future releases?
>
> As for my particular case, the issue was in mesos-appmaster.sh option:
>
> -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
>
> My custom launch script was slicing argument array incorrectly.
>
> Thanks for the help and regards,
> Alex.
>
> чт, 12 мар. 2020 г. в 15:46, Xintong Song <[hidden email]>:
>>
>> Hi Alex,
>>
>> Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.
>>
>> The TM memory configuration process works as follow:
>>
>> The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
>> After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
>> When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
>>
>> One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
>>>
>>> Hi, Alexander
>>>
>>> I could not reproduce it in my local environment. Normally, Mesos RM
>>> will calculate all the mem config and add it to the launch command.
>>> Unfortunately, all the log I could found for this command is at the
>>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>>> anything about the taskmanager launch command you could found in the
>>> current log?
>>>
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
>>> <[hidden email]> wrote:
>>> >
>>> > Hi folks,
>>> >
>>> > I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
>>> > I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>>> >
>>> > 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
>>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
>>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
>>> > at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> > at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>>> > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>>> > at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
>>> > Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
>>> > at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
>>> > ... 9 more
>>> >
>>> > But when task manager is launched, it correctly parses process memory key:
>>> >
>>> > 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
>>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
>>> > 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
>>> > 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>>> > 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
>>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
>>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
>>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
>>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
>>> > ...
>>> >
>>> > Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
>>> > I would appreciate any suggestion.
>>> >
>>> > Regards and thanks in advance,
>>> > Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Alexander Kasyanenko
Instead of just launching TM as it works right now, I suggest to log launch command first, and then launch TM. But that might be unnecessary, since the use case is rather specific.

Regards,
Alex.

чт, 12 мар. 2020 г. в 16:58, Yangze Guo <[hidden email]>:
Glad to hear that your issue is fixed.
I'm not sure what you suggest to add. Could you tell it more specific
or create a Jira ticket?

Best,
Yangze Guo


On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
<[hidden email]> wrote:
>
> Hi Yangze, Xintong,
>
> Thank you for instant response.
>
> And big thanks for the hint on TM launch command. It indeed was the problem. I've added my own custom mesos-taskmanager.sh to echo the launch command (I've switched to DEBUG level on logging, but it didn't really display anything useful). May I suggest to add something like this in the future releases?
>
> As for my particular case, the issue was in mesos-appmaster.sh option:
>
> -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
>
> My custom launch script was slicing argument array incorrectly.
>
> Thanks for the help and regards,
> Alex.
>
> чт, 12 мар. 2020 г. в 15:46, Xintong Song <[hidden email]>:
>>
>> Hi Alex,
>>
>> Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.
>>
>> The TM memory configuration process works as follow:
>>
>> The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
>> After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
>> When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
>>
>> One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
>>>
>>> Hi, Alexander
>>>
>>> I could not reproduce it in my local environment. Normally, Mesos RM
>>> will calculate all the mem config and add it to the launch command.
>>> Unfortunately, all the log I could found for this command is at the
>>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>>> anything about the taskmanager launch command you could found in the
>>> current log?
>>>
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
>>> <[hidden email]> wrote:
>>> >
>>> > Hi folks,
>>> >
>>> > I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
>>> > I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>>> >
>>> > 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
>>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
>>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
>>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
>>> > at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> > at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>>> > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>>> > at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
>>> > Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
>>> > at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
>>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
>>> > ... 9 more
>>> >
>>> > But when task manager is launched, it correctly parses process memory key:
>>> >
>>> > 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
>>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
>>> > 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
>>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
>>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
>>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
>>> > 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>>> > 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
>>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
>>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
>>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
>>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
>>> > ...
>>> >
>>> > Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
>>> > I would appreciate any suggestion.
>>> >
>>> > Regards and thanks in advance,
>>> > Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Yangze Guo
It seems we already have such logs in [1]. If that is the case, +1 for
changing it to INFO level.

[1] https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java#L341
Best,
Yangze Guo

On Thu, Mar 12, 2020 at 4:03 PM Alexander Kasyanenko
<[hidden email]> wrote:

>
> Instead of just launching TM as it works right now, I suggest to log launch command first, and then launch TM. But that might be unnecessary, since the use case is rather specific.
>
> Regards,
> Alex.
>
> чт, 12 мар. 2020 г. в 16:58, Yangze Guo <[hidden email]>:
>>
>> Glad to hear that your issue is fixed.
>> I'm not sure what you suggest to add. Could you tell it more specific
>> or create a Jira ticket?
>>
>> Best,
>> Yangze Guo
>>
>>
>> On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
>> <[hidden email]> wrote:
>> >
>> > Hi Yangze, Xintong,
>> >
>> > Thank you for instant response.
>> >
>> > And big thanks for the hint on TM launch command. It indeed was the problem. I've added my own custom mesos-taskmanager.sh to echo the launch command (I've switched to DEBUG level on logging, but it didn't really display anything useful). May I suggest to add something like this in the future releases?
>> >
>> > As for my particular case, the issue was in mesos-appmaster.sh option:
>> >
>> > -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
>> >
>> > My custom launch script was slicing argument array incorrectly.
>> >
>> > Thanks for the help and regards,
>> > Alex.
>> >
>> > чт, 12 мар. 2020 г. в 15:46, Xintong Song <[hidden email]>:
>> >>
>> >> Hi Alex,
>> >>
>> >> Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.
>> >>
>> >> The TM memory configuration process works as follow:
>> >>
>> >> The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
>> >> After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
>> >> When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
>> >>
>> >> One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.
>> >>
>> >> Thank you~
>> >>
>> >> Xintong Song
>> >>
>> >>
>> >>
>> >> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
>> >>>
>> >>> Hi, Alexander
>> >>>
>> >>> I could not reproduce it in my local environment. Normally, Mesos RM
>> >>> will calculate all the mem config and add it to the launch command.
>> >>> Unfortunately, all the log I could found for this command is at the
>> >>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>> >>> anything about the taskmanager launch command you could found in the
>> >>> current log?
>> >>>
>> >>>
>> >>> Best,
>> >>> Yangze Guo
>> >>>
>> >>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
>> >>> <[hidden email]> wrote:
>> >>> >
>> >>> > Hi folks,
>> >>> >
>> >>> > I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
>> >>> > I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
>> >>> >
>> >>> > 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
>> >>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
>> >>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
>> >>> > at java.base/java.security.AccessController.doPrivileged(Native Method)
>> >>> > at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>> >>> > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>> >>> > at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> >>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
>> >>> > Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
>> >>> > at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
>> >>> > ... 9 more
>> >>> >
>> >>> > But when task manager is launched, it correctly parses process memory key:
>> >>> >
>> >>> > 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>> >>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
>> >>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
>> >>> > 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
>> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
>> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
>> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
>> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
>> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
>> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
>> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
>> >>> > 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
>> >>> > 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
>> >>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
>> >>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
>> >>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
>> >>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
>> >>> > ...
>> >>> >
>> >>> > Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
>> >>> > I would appreciate any suggestion.
>> >>> >
>> >>> > Regards and thanks in advance,
>> >>> > Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10 container memory configuration with Mesos.

Yangze Guo
BTW, the dynamic config will also occur in TM side logs [1]. It would
be good to print it in INFO level as well.

[1] https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/entrypoint/MesosTaskExecutorRunner.java#L77

Best,
Yangze Guo

On Thu, Mar 12, 2020 at 4:06 PM Yangze Guo <[hidden email]> wrote:

>
> It seems we already have such logs in [1]. If that is the case, +1 for
> changing it to INFO level.
>
> [1] https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java#L341
> Best,
> Yangze Guo
>
> On Thu, Mar 12, 2020 at 4:03 PM Alexander Kasyanenko
> <[hidden email]> wrote:
> >
> > Instead of just launching TM as it works right now, I suggest to log launch command first, and then launch TM. But that might be unnecessary, since the use case is rather specific.
> >
> > Regards,
> > Alex.
> >
> > чт, 12 мар. 2020 г. в 16:58, Yangze Guo <[hidden email]>:
> >>
> >> Glad to hear that your issue is fixed.
> >> I'm not sure what you suggest to add. Could you tell it more specific
> >> or create a Jira ticket?
> >>
> >> Best,
> >> Yangze Guo
> >>
> >>
> >> On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
> >> <[hidden email]> wrote:
> >> >
> >> > Hi Yangze, Xintong,
> >> >
> >> > Thank you for instant response.
> >> >
> >> > And big thanks for the hint on TM launch command. It indeed was the problem. I've added my own custom mesos-taskmanager.sh to echo the launch command (I've switched to DEBUG level on logging, but it didn't really display anything useful). May I suggest to add something like this in the future releases?
> >> >
> >> > As for my particular case, the issue was in mesos-appmaster.sh option:
> >> >
> >> > -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
> >> >
> >> > My custom launch script was slicing argument array incorrectly.
> >> >
> >> > Thanks for the help and regards,
> >> > Alex.
> >> >
> >> > чт, 12 мар. 2020 г. в 15:46, Xintong Song <[hidden email]>:
> >> >>
> >> >> Hi Alex,
> >> >>
> >> >> Could you try to check and post your TM launch command? I suspect that there might be some unrecognized arguments that prevent the rest of arguments being parsed.
> >> >>
> >> >> The TM memory configuration process works as follow:
> >> >>
> >> >> The resource manager will parse the configurations, checking which options are configured and which are not, and calculate the size of each memory component. (This is where ‘taskmanager.memory.process.size’ is used.)
> >> >> After deriving the memory component sizes, the resource manager will generate launch command for the task managers, with dynamic configurations "-D <key=value>" overwriting the memory component sizes. Therefore, even you have not configured 'taskmanager.memory.task.heap.size', it is expected that before when the TM is launched this config option should be available.
> >> >> When a task manager is started, it will not do the calculations again, and will directly read the memory component sizes calculated by resource manager from the dynamic configurations. That means it is not reading ‘taskmanager.memory.process.size’ and deriving memory component sizes from it again.
> >> >>
> >> >> One thing that might have caused your problem is that, when MesosTaskExecutorRunner parses the command line arguments (that's where the dynamic configurations are passed in), if it meets an unrecognized token it will stop parsing the rest of the arguments. That could be the reason that 'taskmanager.memory.task.heap.size' is missing. You can take a look at the launching command, see if there's anything unexpected before the memory dynamic configurations.
> >> >>
> >> >> Thank you~
> >> >>
> >> >> Xintong Song
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <[hidden email]> wrote:
> >> >>>
> >> >>> Hi, Alexander
> >> >>>
> >> >>> I could not reproduce it in my local environment. Normally, Mesos RM
> >> >>> will calculate all the mem config and add it to the launch command.
> >> >>> Unfortunately, all the log I could found for this command is at the
> >> >>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
> >> >>> anything about the taskmanager launch command you could found in the
> >> >>> current log?
> >> >>>
> >> >>>
> >> >>> Best,
> >> >>> Yangze Guo
> >> >>>
> >> >>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
> >> >>> <[hidden email]> wrote:
> >> >>> >
> >> >>> > Hi folks,
> >> >>> >
> >> >>> > I have a question related configuration for new memory introduced in flink 1.10. Has anyone encountered similar problem?
> >> >>> > I'm trying to make use of taskmanager.memory.process.size configuration key in combination with mesos session cluster, but I get an error like this:
> >> >>> >
> >> >>> > 2020-03-11 11:44:09,771 [main] ERROR org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while starting the TaskManager
> >> >>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
> >> >>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
> >> >>> > at java.base/java.security.AccessController.doPrivileged(Native Method)
> >> >>> > at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> >> >>> > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
> >> >>> > at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> >> >>> > at org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
> >> >>> > Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.memory.task.heap.size' , default: null (fallback keys: []) is not set
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
> >> >>> > at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
> >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
> >> >>> > ... 9 more
> >> >>> >
> >> >>> > But when task manager is launched, it correctly parses process memory key:
> >> >>> >
> >> >>> > 2020-03-11 11:43:55,376 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> >> >>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET)
> >> >>> > 2020-03-11 11:43:55,377 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current user: root
> >> >>> > 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> >> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM: OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
> >> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum heap size: 746 MiBytes
> >> >>> > 2020-03-11 11:43:57,535 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME: (not set)
> >> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop version: 2.6.5
> >> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM Options:
> >> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xmx781818251
> >> >>> > 2020-03-11 11:43:57,539 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Xms781818251
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxDirectMemorySize=317424929
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -XX:MaxMetaspaceSize=100663296
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program Arguments: (none)
> >> >>> > 2020-03-11 11:43:57,540 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath: /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
> >> >>> > 2020-03-11 11:43:57,541 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - --------------------------------------------------------------------------------
> >> >>> > 2020-03-11 11:43:57,542 [main] INFO  org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered UNIX signal handlers for [TERM, HUP, INT]
> >> >>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.process.size, 2g
> >> >>> > 2020-03-11 11:43:57,550 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.cpu.cores, 2
> >> >>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
> >> >>> > 2020-03-11 11:43:57,551 [main] INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
> >> >>> > ...
> >> >>> >
> >> >>> > Judging by the docs specifying taskmanager.memory.process.size key should be enough to launch the job, but it seems like this value is ignored.
> >> >>> > I would appreciate any suggestion.
> >> >>> >
> >> >>> > Regards and thanks in advance,
> >> >>> > Alex.