job doesn't start via cli after migrating Flink from 1.8 to 1.10

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

job doesn't start via cli after migrating Flink from 1.8 to 1.10

Vitaliy Semochkin
Hi,

I've recently migrated from Flink 1.8 to Flink 1.10 
And when I start the job using YarnClusterDescriptor.deployJobCluster method everything works fine.

However when I start the job from shell script, the job fails with messages:
Shell script reports:
Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12000, slotsPerTaskManager=3}
YarnClusterDescriptor  Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

1. The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?
yarn log content of the job reports:
2020-04-08 14:31:02,558 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202
0 @ 19:18:19 CET)
2020-04-08 14:31:02,558 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: yarn
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: erm_user
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 406 MiBytes
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /usr/jdk64/jdk1.8.0_112
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xms424m
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xmx424m

2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:log4j.properties
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments: (none)

2. Why ClusterEnterpoint reports -xmx424m ?
3. When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager,
ClusterEnterpoint reports -xmx424m is responsible for?

Second suspicious message in log is:
2020-04-08 16:28:50,840 WARN  org.apache.zookeeper.ClientCnxn                               - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
4. What leads to this exception and how am I supposed to configure JAAS section named Client?

Third suspicious message, though most likely it is an outcome of something being incorrectly configured:
2020-04-08 16:28:52,115 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}]
2020-04-08 16:28:52,120 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}]
2020-04-08 16:28:52,121 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c632edc6775f52762cb5a981a4109b89}]
2020-04-08 16:28:52,125 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://[hidden email]:39331/user/resourcemanager(00000000000000000000000000000000)
2020-04-08 16:28:52,130 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Could not resolve ResourceManager address akka.tcp://[hidden email]-central-1.compute.internal:39331/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]:39331/user/resourcemanager..

5. What can be the reason of failures  to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

PS
The whole yarn log is attached.


flink-job.log (116K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: job doesn't start via cli after migrating Flink from 1.8 to 1.10

Yang Wang
I am trying to answer your question inline.

> The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?

From 1.10, the taskmanager resource related configuration has changed and the default value is bigger than before. So you may find the same application costs more resources. You could checkout the migration guide[1] for more information.


> Why ClusterEnterpoint reports -xmx424m ?
 
Since the default cut-off is 600m(configured via “containerized.heap-cutoff-min”), the heap size of jobmanager is 1024m. Only 424m left for the jobmanager heap.

> When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager, ClusterEnterpoint reports -xmx424m is responsible for?

The “-xmx424m” is just for jobmanager heap size. You need to check the taskmanager logs whether the memory setting is expected.

> What leads to this exception and how am I supposed to configure JAAS section named Client?

I am not an export of security. However, it seems that Flink create a default empty JAAS file and the zookeeper client tries to load it. So it causes such warning log. But i think it is unrelated with your problem. I have tries on my YARN cluster, the same logs show up and the Flink job runs pretty well. If you really want to connect with zookeeper with JAAS, i think you need to specify your own valid JAAS file.[2]

> What can be the reason of failures  to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

It is quite strange that the JobMaster and FlinkResourceManager is running in a same process. However, the JobMaster could not connect with the address “ip-172-31-65-130.eu-central-1.compute.internal:39331”. When you find such logs, could you login the YARN nodemanager to check whether the JobManager process is listening at the specified port and then use `telnet` to check the network connectivity?

Also i think you could have a try to configure the jobmanager rpc port with a fixed one or port range(configured via “yarn.application-master.port”).

Hope it could help you somewhat.

[1]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
[2]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/security-kerberos.html#jaas-security-module


Best,
Yang

Vitaliy Semochkin <[hidden email]> 于2020年4月9日周四 上午2:02写道:
Hi,

I've recently migrated from Flink 1.8 to Flink 1.10 
And when I start the job using YarnClusterDescriptor.deployJobCluster method everything works fine.

However when I start the job from shell script, the job fails with messages:
Shell script reports:
Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12000, slotsPerTaskManager=3}
YarnClusterDescriptor  Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

1. The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?
yarn log content of the job reports:
2020-04-08 14:31:02,558 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202
0 @ 19:18:19 CET)
2020-04-08 14:31:02,558 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: yarn
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: erm_user
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 406 MiBytes
2020-04-08 14:31:03,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /usr/jdk64/jdk1.8.0_112
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2020-04-08 14:31:03,093 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xms424m
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xmx424m

2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:log4j.properties
2020-04-08 14:31:03,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments: (none)

2. Why ClusterEnterpoint reports -xmx424m ?
3. When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager,
ClusterEnterpoint reports -xmx424m is responsible for?

Second suspicious message in log is:
2020-04-08 16:28:50,840 WARN  org.apache.zookeeper.ClientCnxn                               - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
4. What leads to this exception and how am I supposed to configure JAAS section named Client?

Third suspicious message, though most likely it is an outcome of something being incorrectly configured:
2020-04-08 16:28:52,115 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}]
2020-04-08 16:28:52,120 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}]
2020-04-08 16:28:52,121 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c632edc6775f52762cb5a981a4109b89}]
2020-04-08 16:28:52,125 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://[hidden email]:39331/user/resourcemanager(00000000000000000000000000000000)
2020-04-08 16:28:52,130 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Could not resolve ResourceManager address akka.tcp://[hidden email]-central-1.compute.internal:39331/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]:39331/user/resourcemanager..

5. What can be the reason of failures  to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

PS
The whole yarn log is attached.