Did anything change in 1.1.4 with regard to YARN & vcores?
I'm getting this error when deploying 1.1.4 to my test cluster. Only the Flink version changed.
[0mjava.lang.RuntimeException: Couldn't deploy Yarn cluster [0m at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:384) [0m at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:591) [0m at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:465) [0mCaused by: org.apache.flink.configuration.IllegalConfigurationException: The number of virtual cores per node were configured with 8 but Yarn only has 4 virtual cores available. Please note that the number of virtual cores is set to the number of task slots by default unless configured in the Flink config with 'yarn.containers.vcores.' [0m at org.apache.flink.yarn.AbstractYarnClusterDescriptor.isReadyForDeployment(AbstractYarnClusterDescriptor.java:273) [0m at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:393) [0m at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:381) [0m ... 2 more When I run: ./bin/yarn-session.sh –q
It shows 8 vCores on each machine:
NodeManagers in the ClusterClient 3|Property |Value +---------------------------------------+ |NodeID |ip-10-2-…:8041 |Memory |12288 MB |vCores |8 |HealthReport | |Containers |0 +---------------------------------------+ |NodeID |ip-10-2-…:8041 |Memory |12288 MB |vCores |8 |HealthReport | |Containers |0 +---------------------------------------+ |NodeID |ip-10-2-…:8041 |Memory |12288 MB |vCores |8 |HealthReport | |Containers |0 +---------------------------------------+ Summary: totalMemory 36864 totalCores 24 Queue: default, Current Capacity: 0.0 Max Capacity: 1.0 Applications: 0 I'm running:
./bin/yarn-session.sh –n 3 --jobManagerMemory 1504 --taskManagerMemory 10764 --slots 8 —detached
I have not specified any value for "yarn.containers.vcores" in my config.
I switched to –n 5 and —slots 4, and halved the taskManagerMemory, which allowed the cluster to start.
However, in the YARN "Nodes" UI I see "VCores Used: 2" and "VCores Avail: 6" on all three nodes. And if I look at one of the Containers, it says, "Resource: 5408 Memory, 1 VCores". I don't understand what's happening here.
Thanks…
|
Hi Shannon, Flink is reading the number of available vcores from the local YARN configuration. Is it possible that the YARN / Hadoop config on the machine where you are submitting your job from sets the number of vcores as 4 ? On Fri, Jan 13, 2017 at 12:51 AM, Shannon Carey <[hidden email]> wrote:
|
On Fri, Jan 13, 2017 at 9:57 AM, Robert Metzger <[hidden email]> wrote:
> Flink is reading the number of available vcores from the local YARN > configuration. Is it possible that the YARN / Hadoop config on the machine > where you are submitting your job from sets the number of vcores as 4 ? Shouldn't we retrieve this number from the cluster instead? |
Ufuk & Robert,
There's a good chance you're right! On the EMR master node, where yarn-session.sh is run, /etc/hadoop/conf/yarn-site.xml says that "yarn.nodemanager.resource.cpu-vcores" is 4. Meanwhile, on the core nodes, the value in that file is 8. Shall I submit a JIRA? This might be pretty easy to fix given that "yarn-session.sh -q" already knows how to get the vcore count on the nodes. I can try to make a PR for it too. I'm still not sure why the containers are showing up as only using one vcore though... or if that is expected. Meanwhile, it seems like overriding yarn.containers.vcores would be a successful workaround. Let me know if you disagree. The other slightly annoying thing that I have to deal with is leaving enough memory for the JobManager. Since all task managers are the same size, I either need to reduce the size of every task manager (wasting resources), or I have to double the task managers (and halve the memory) & subtract one (basically doubling the number of separate JVMs & halving the slot density within the JVMs) in order to leave room for the JobManager. What do you guys think of the following change in approach? User specifies: number of taskmanagers memory per slot (not per taskmanager) total number of slots (not slots per taskmanager) Then, Flink would decide how to organize the task managers & slots in order to also leave room for the JobManager. This should be straightforward compared to bin packing because all slots are the same size. Maybe I'm oversimplifying... might be a little tougher if the nodes are different sizes and we don't know on what node the ApplicationMaster/JobManager will run. -Shannon On 1/13/17, 2:59 AM, "Ufuk Celebi" <[hidden email]> wrote: >On Fri, Jan 13, 2017 at 9:57 AM, Robert Metzger <[hidden email]> wrote: >> Flink is reading the number of available vcores from the local YARN >> configuration. Is it possible that the YARN / Hadoop config on the machine >> where you are submitting your job from sets the number of vcores as 4 ? > >Shouldn't we retrieve this number from the cluster instead? > |
Free forum by Nabble | Edit this page |