(DEPRECATED) Apache Flink User Mailing List archive.

job doesn't start via cli after migrating Flink from 1.8 to 1.10

Classic

List

Threaded

2 messages Options

Vitaliy Semochkin

job doesn't start via cli after migrating Flink from 1.8 to 1.10

Hi,

I've recently migrated from Flink 1.8 to Flink 1.10

And when I start the job using YarnClusterDescriptor.deployJobCluster method everything works fine.

However when I start the job from shell script, the job fails with messages:

Shell script reports:

Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12000, slotsPerTaskManager=3}
YarnClusterDescriptor Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

1. The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?

yarn log content of the job reports:

2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202

0 @ 19:18:19 CET)
2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: erm_user
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 406 MiBytes
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/jdk64/jdk1.8.0_112
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.7.5
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms424m
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx424m
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties

2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)

2. Why ClusterEnterpoint reports -xmx424m ?

3. When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager,

ClusterEnterpoint reports -xmx424m is responsible for?

Second suspicious message in log is:

2020-04-08 16:28:50,840 WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

4. What leads to this exception and how am I supposed to configure JAAS section named Client?

Third suspicious message, though most likely it is an outcome of something being incorrectly configured:

2020-04-08 16:28:52,115 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}]
2020-04-08 16:28:52,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}]
2020-04-08 16:28:52,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c632edc6775f52762cb5a981a4109b89}]
2020-04-08 16:28:52,125 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://[hidden email]:39331/user/resourcemanager(00000000000000000000000000000000)
2020-04-08 16:28:52,130 INFO org.apache.flink.runtime.jobmaster.JobMaster - Could not resolve ResourceManager address akka.tcp://[hidden email]-central-1.compute.internal:39331/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]:39331/user/resourcemanager..

5. What can be the reason of failures to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

The whole yarn log is attached.

flink-job.log (116K) Download Attachment

Yang Wang

Re: job doesn't start via cli after migrating Flink from 1.8 to 1.10

I am trying to answer your question inline.

> The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?

From 1.10, the taskmanager resource related configuration has changed and the default value is bigger than before. So you may find the same application costs more resources. You could checkout the migration guide[1] for more information.

> Why ClusterEnterpoint reports -xmx424m ?

Since the default cut-off is 600m(configured via “containerized.heap-cutoff-min”), the heap size of jobmanager is 1024m. Only 424m left for the jobmanager heap.

> When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager, ClusterEnterpoint reports -xmx424m is responsible for?

The “-xmx424m” is just for jobmanager heap size. You need to check the taskmanager logs whether the memory setting is expected.

> What leads to this exception and how am I supposed to configure JAAS section named Client?

I am not an export of security. However, it seems that Flink create a default empty JAAS file and the zookeeper client tries to load it. So it causes such warning log. But i think it is unrelated with your problem. I have tries on my YARN cluster, the same logs show up and the Flink job runs pretty well. If you really want to connect with zookeeper with JAAS, i think you need to specify your own valid JAAS file.[2]

> What can be the reason of failures to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

It is quite strange that the JobMaster and FlinkResourceManager is running in a same process. However, the JobMaster could not connect with the address “ip-172-31-65-130.eu-central-1.compute.internal:39331”. When you find such logs, could you login the YARN nodemanager to check whether the JobManager process is listening at the specified port and then use `telnet` to check the network connectivity?

Also i think you could have a try to configure the jobmanager rpc port with a fixed one or port range(configured via “yarn.application-master.port”).

Hope it could help you somewhat.

[1]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
[2]. https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/security-kerberos.html#jaas-security-module

Best,

Yang

Vitaliy Semochkin <[hidden email]> 于2020年4月9日周四上午2:02写道：

Hi,

I've recently migrated from Flink 1.8 to Flink 1.10
And when I start the job using YarnClusterDescriptor.deployJobCluster method everything works fine.

However when I start the job from shell script, the job fails with messages:
Shell script reports:
Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12000, slotsPerTaskManager=3}
YarnClusterDescriptor Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

1. The server has twice more than that, and on flink 1.8 this configuration works, why when switching to 1.10 it is not enough resources?
yarn log content of the job reports:
2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202
0 @ 19:18:19 CET)
2020-04-08 14:31:02,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: erm_user
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 406 MiBytes
2020-04-08 14:31:03,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/jdk64/jdk1.8.0_112
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.7.5
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2020-04-08 14:31:03,093 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms424m
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx424m
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties
2020-04-08 14:31:03,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)

2. Why ClusterEnterpoint reports -xmx424m ?
3. When I start the job YarnClusterDescriptor.deployJobCluster it reports the amount of memory assigned to the task manager,
ClusterEnterpoint reports -xmx424m is responsible for?

Second suspicious message in log is:
2020-04-08 16:28:50,840 WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
4. What leads to this exception and how am I supposed to configure JAAS section named Client?

Third suspicious message, though most likely it is an outcome of something being incorrectly configured:
2020-04-08 16:28:52,115 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}]
2020-04-08 16:28:52,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}]
2020-04-08 16:28:52,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c632edc6775f52762cb5a981a4109b89}]
2020-04-08 16:28:52,125 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://[hidden email]:39331/user/resourcemanager(00000000000000000000000000000000)
2020-04-08 16:28:52,130 INFO org.apache.flink.runtime.jobmaster.JobMaster - Could not resolve ResourceManager address akka.tcp://[hidden email]-central-1.compute.internal:39331/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[hidden email]:39331/user/resourcemanager..

5. What can be the reason of failures to connect ResourceManager (if with flink 1.8 the job didn't have such issues, it's not a firewall issue or lack of resources)?

PS
The whole yarn log is attached.