Hi, I've recently migrated from Flink 1.8 to Flink 1.10 And when I start the job using YarnClusterDescriptor.deployJobCluster method everything works fine. However when I start the job from shell script, the job fails with messages: Shell script reports: Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12000, slotsPerTaskManager=3} YarnClusterDescriptor Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster 1. The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources? yarn log content of the job reports: 2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202 0 @ 19:18:19 CET)2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn 2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: erm_user 2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15 2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 406 MiBytes 2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/jdk64/jdk1.8.0_112 2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.7.5 2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options: 2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms424m 2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx424m 2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log 2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties 2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none) 2. Why ClusterEnterpoint reports -xmx424m ? 3. When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager, ClusterEnterpoint reports -xmx424m is responsible for? Second suspicious message in log is: 2020-04-08 16:28:50,840 WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 4. What leads to this exception and how am I supposed to configure JAAS section named Client? Third suspicious message, though most likely it is an outcome of something being incorrectly configured: 2020-04-08 16:28:52,115 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}] 2020-04-08 16:28:52,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}] 2020-04-08 16:28:52,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c632edc6775f52762cb5a981a4109b89}] 2020-04-08 16:28:52,125 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://[hidden email]:39331/user/resourcemanager(00000000000000000000000000000000) 2020-04-08 16:28:52,130 INFO org.apache.flink.runtime.jobmaster.JobMaster - Could not resolve ResourceManager address akka.tcp://[hidden email]-central-1.compute.internal:39331/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]:39331/user/resourcemanager.. 5. What can be the reason of failures to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)? PS The whole yarn log is attached. flink-job.log (116K) Download Attachment |
I am trying to answer your question inline. > The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources? From 1.10, the taskmanager resource related configuration has changed and the default value is bigger than before. So you may find the same application costs more resources. You could checkout the migration guide[1] for more information.
> When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager, ClusterEnterpoint reports -xmx424m is responsible for? The “-xmx424m” is just for jobmanager heap size. You need to check the taskmanager logs whether the memory setting is expected. > What leads to this exception and how am I supposed to configure JAAS section named Client? > What can be the reason of failures to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)? It is quite strange that the JobMaster and FlinkResourceManager is running in a same process. However, the JobMaster could not connect with the address “ip-172-31-65-130.eu-central-1.compute.internal:39331”. When you find such logs, could you login the YARN nodemanager to check whether the JobManager process is listening at the specified port and then use `telnet` to check the network connectivity? Also i think you could have a try to configure the jobmanager rpc port with a fixed one or port range(configured via “yarn.application-master.port”). Hope it could help you somewhat. [1]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html [2]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/security-kerberos.html#jaas-security-module Best, Yang Vitaliy Semochkin <[hidden email]> 于2020年4月9日周四 上午2:02写道:
|
Free forum by Nabble | Edit this page |