ClusterSpecification and Configuration questions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

ClusterSpecification and Configuration questions

Vitaliy Semochkin
Hi,

I create a job with following parameters:
org.apache.flink.configuration.Configuration{
yarn.containers.vcores=2
yarn.appmaster.vcores=1
}

ClusterSpecification{
taskManagerMemoryMB=1024
slotsPerTaskManager=1
}
After I launch job programmatically I have :
yarn node -list -showDetails                   
Configured Resources : <memory:8192, vCores:8>
Allocated Resources : <memory:1250, vCores:1> - I suppose this was created for JobManager

But in logs I see 3 requests to create Requesting new TaskExecutor container with resources <memory:2048, vCores:2>
  
Here is a log fragment:
 JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 1.
 org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{UNKNOWN} for job 64080d7889797133215e501e72b23a74 with allocation id a1c9ff2b7ec9ad662108b8a2b2301fcf.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 2.
 org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{UNKNOWN} for job 64080d7889797133215e501e72b23a74 with allocation id 21f57b4324bdd50dd293547bc4b19ce2.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 3.
Close ResourceManager connection
Shut down cluster because application is in FAILED, diagnostics null.

Here are things I would like to clarify:
Why there are 3 requests to create TaskExecutor instead of 1?
Why no task executor is created despite I have 7 cores and 7 GB  of free RAM?
What is ResourceProfile{UNKNOWN}?
What is diagnostic null?

When I change number ClusterSpecification.slotsPerTaskManager to 1 - I get :
"Cannot serve slot request, no ResourceManager connected"
"Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources"
Why ResourceManager aint created despite I request even even less resource for this?


Regards,
Vitaliy


Reply | Threaded
Open this post in threaded view
|

Re: ClusterSpecification and Configuration questions

Xintong Song
Hi Vitaliy,

Do you mean you are modifying the code of ClusterSpecification? I believe this is an internal class and is not meant to be modified by users. Changing the internal code directly might lead to internal inconsistency and unpredictable problems. If you want to modify JM/TM memory and slots per TM, please use the configuration options.

I think the major problem in your case is that the TaskExecutor cannot be started. Would you mind to post the complete log file? That should be helpful for people to understand what has caused the problem. The posted log fragments are not very helpful to that end.

In addition, would you be able to check the Yarn logs? See if the container requests are received and containers are allocated.

Thank you~

Xintong Song



On Tue, Mar 24, 2020 at 6:45 AM Vitaliy Semochkin <[hidden email]> wrote:
Hi,

I create a job with following parameters:
org.apache.flink.configuration.Configuration{
yarn.containers.vcores=2
yarn.appmaster.vcores=1
}

ClusterSpecification{
taskManagerMemoryMB=1024
slotsPerTaskManager=1
}
After I launch job programmatically I have :
yarn node -list -showDetails                   
Configured Resources : <memory:8192, vCores:8>
Allocated Resources : <memory:1250, vCores:1> - I suppose this was created for JobManager

But in logs I see 3 requests to create Requesting new TaskExecutor container with resources <memory:2048, vCores:2>
  
Here is a log fragment:
 JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 1.
 org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{UNKNOWN} for job 64080d7889797133215e501e72b23a74 with allocation id a1c9ff2b7ec9ad662108b8a2b2301fcf.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 2.
 org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{UNKNOWN} for job 64080d7889797133215e501e72b23a74 with allocation id 21f57b4324bdd50dd293547bc4b19ce2.
 org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:2048, vCores:2>. Number pending requests 3.
Close ResourceManager connection
Shut down cluster because application is in FAILED, diagnostics null.

Here are things I would like to clarify:
Why there are 3 requests to create TaskExecutor instead of 1?
Why no task executor is created despite I have 7 cores and 7 GB  of free RAM?
What is ResourceProfile{UNKNOWN}?
What is diagnostic null?

When I change number ClusterSpecification.slotsPerTaskManager to 1 - I get :
"Cannot serve slot request, no ResourceManager connected"
"Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources"
Why ResourceManager aint created despite I request even even less resource for this?


Regards,
Vitaliy