1.1.4 on YARN - vcores change?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

1.1.4 on YARN - vcores change?

Shannon Carey
Did anything change in 1.1.4 with regard to YARN & vcores?

I'm getting this error when deploying 1.1.4 to my test cluster. Only the Flink version changed.
java.lang.RuntimeException: Couldn't deploy Yarn cluster
	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:384)
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:591)
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:465)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The number of virtual cores per node were configured with 8 but Yarn only has 4 virtual cores available. Please note that the number of virtual cores is set to the number of task slots by default unless configured in the Flink config with 'yarn.containers.vcores.'
	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.isReadyForDeployment(AbstractYarnClusterDescriptor.java:273)
	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:393)
	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:381)
	... 2 more

When I run: ./bin/yarn-session.sh –q
It shows 8 vCores on each machine:

NodeManagers in the ClusterClient 3|Property         |Value          

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

Summary: totalMemory 36864 totalCores 24

Queue: default, Current Capacity: 0.0 Max Capacity: 1.0 Applications: 0


I'm running:
./bin/yarn-session.sh –n 3 --jobManagerMemory 1504 --taskManagerMemory 10764 --slots 8 —detached

I have not specified any value for "yarn.containers.vcores" in my config.

I switched to –n 5 and —slots 4, and halved the taskManagerMemory, which allowed the cluster to start.

However, in the YARN "Nodes" UI I see "VCores Used: 2" and "VCores Avail: 6" on all three nodes. And if I look at one of the Containers, it says, "Resource: 5408 Memory, 1 VCores". I don't understand what's happening here.

Thanks…
Reply | Threaded
Open this post in threaded view
|

Re: 1.1.4 on YARN - vcores change?

rmetzger0
Hi Shannon,

Flink is reading the number of available vcores from the local YARN configuration. Is it possible that the YARN / Hadoop config on the machine where you are submitting your job from sets the number of vcores as 4 ?


On Fri, Jan 13, 2017 at 12:51 AM, Shannon Carey <[hidden email]> wrote:
Did anything change in 1.1.4 with regard to YARN & vcores?

I'm getting this error when deploying 1.1.4 to my test cluster. Only the Flink version changed.
 [0mjava.lang.RuntimeException: Couldn't deploy Yarn cluster
 [0m	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:384)
 [0m	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:591)
 [0m	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:465)
 [0mCaused by: org.apache.flink.configuration.IllegalConfigurationException: The number of virtual cores per node were configured with 8 but Yarn only has 4 virtual cores available. Please note that the number of virtual cores is set to the number of task slots by default unless configured in the Flink config with 'yarn.containers.vcores.'
 [0m	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.isReadyForDeployment(AbstractYarnClusterDescriptor.java:273)
 [0m	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:393)
 [0m	at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:381)
 [0m	... 2 more

When I run: ./bin/yarn-session.sh –q
It shows 8 vCores on each machine:

NodeManagers in the ClusterClient 3|Property         |Value          

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

|NodeID           |ip-10-2-…:8041 

|Memory           |12288 MB         

|vCores           |8                

|HealthReport     |                 

|Containers       |0                

+---------------------------------------+

Summary: totalMemory 36864 totalCores 24

Queue: default, Current Capacity: 0.0 Max Capacity: 1.0 Applications: 0


I'm running:
./bin/yarn-session.sh –n 3 --jobManagerMemory 1504 --taskManagerMemory 10764 --slots 8 —detached

I have not specified any value for "yarn.containers.vcores" in my config.

I switched to –n 5 and —slots 4, and halved the taskManagerMemory, which allowed the cluster to start.

However, in the YARN "Nodes" UI I see "VCores Used: 2" and "VCores Avail: 6" on all three nodes. And if I look at one of the Containers, it says, "Resource: 5408 Memory, 1 VCores". I don't understand what's happening here.

Thanks…

Reply | Threaded
Open this post in threaded view
|

Re: 1.1.4 on YARN - vcores change?

Ufuk Celebi
On Fri, Jan 13, 2017 at 9:57 AM, Robert Metzger <[hidden email]> wrote:
> Flink is reading the number of available vcores from the local YARN
> configuration. Is it possible that the YARN / Hadoop config on the machine
> where you are submitting your job from sets the number of vcores as 4 ?

Shouldn't we retrieve this number from the cluster instead?
Reply | Threaded
Open this post in threaded view
|

Re: 1.1.4 on YARN - vcores change?

Shannon Carey
Ufuk & Robert,

There's a good chance you're right! On the EMR master node, where yarn-session.sh is run, /etc/hadoop/conf/yarn-site.xml says that "yarn.nodemanager.resource.cpu-vcores" is 4.


Meanwhile, on the core nodes, the value in that file is 8.





Shall I submit a JIRA? This might be pretty easy to fix given that "yarn-session.sh -q" already knows how to get the vcore count on the nodes. I can try to make a PR for it too. I'm still not sure why the containers are showing up as only using one vcore though... or if that is expected.

Meanwhile, it seems like overriding yarn.containers.vcores would be a successful workaround. Let me know if you disagree.

The other slightly annoying thing that I have to deal with is leaving enough memory for the JobManager. Since all task managers are the same size, I either need to reduce the size of every task manager (wasting resources), or I have to double the task managers (and halve the memory) & subtract one (basically doubling the number of separate JVMs & halving the slot density within the JVMs) in order to leave room for the JobManager. What do you guys think of the following change in approach?

User specifies:
number of taskmanagers
memory per slot (not per taskmanager)
total number of slots (not slots per taskmanager)

Then, Flink would decide how to organize the task managers & slots in order to also leave room for the JobManager. This should be straightforward compared to bin packing because all slots are the same size. Maybe I'm oversimplifying... might be a little tougher if the nodes are different sizes and we don't know on what node the ApplicationMaster/JobManager will run.

-Shannon

On 1/13/17, 2:59 AM, "Ufuk Celebi" <[hidden email]> wrote:

>On Fri, Jan 13, 2017 at 9:57 AM, Robert Metzger <[hidden email]> wrote:
>> Flink is reading the number of available vcores from the local YARN
>> configuration. Is it possible that the YARN / Hadoop config on the machine
>> where you are submitting your job from sets the number of vcores as 4 ?
>
>Shouldn't we retrieve this number from the cluster instead?
>