(DEPRECATED) Apache Flink User Mailing List archive.

Re: 1.1.4 on YARN - vcores change?

Posted by Shannon Carey on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/1-1-4-on-YARN-vcores-change-tp11016p11039.html

Ufuk & Robert,

There's a good chance you're right! On the EMR master node, where yarn-session.sh is run, /etc/hadoop/conf/yarn-site.xml says that "yarn.nodemanager.resource.cpu-vcores" is 4.

Meanwhile, on the core nodes, the value in that file is 8.

Shall I submit a JIRA? This might be pretty easy to fix given that "yarn-session.sh -q" already knows how to get the vcore count on the nodes. I can try to make a PR for it too. I'm still not sure why the containers are showing up as only using one vcore though... or if that is expected.

Meanwhile, it seems like overriding yarn.containers.vcores would be a successful workaround. Let me know if you disagree.

The other slightly annoying thing that I have to deal with is leaving enough memory for the JobManager. Since all task managers are the same size, I either need to reduce the size of every task manager (wasting resources), or I have to double the task managers (and halve the memory) & subtract one (basically doubling the number of separate JVMs & halving the slot density within the JVMs) in order to leave room for the JobManager. What do you guys think of the following change in approach?

User specifies:
number of taskmanagers
memory per slot (not per taskmanager)
total number of slots (not slots per taskmanager)

Then, Flink would decide how to organize the task managers & slots in order to also leave room for the JobManager. This should be straightforward compared to bin packing because all slots are the same size. Maybe I'm oversimplifying... might be a little tougher if the nodes are different sizes and we don't know on what node the ApplicationMaster/JobManager will run.

-Shannon

On 1/13/17, 2:59 AM, "Ufuk Celebi" <[hidden email]> wrote:

>On Fri, Jan 13, 2017 at 9:57 AM, Robert Metzger <[hidden email]> wrote:
>> Flink is reading the number of available vcores from the local YARN
>> configuration. Is it possible that the YARN / Hadoop config on the machine
>> where you are submitting your job from sets the number of vcores as 4 ?
>
>Shouldn't we retrieve this number from the cluster instead?
>