Hi,
I am using Amazon EMR(emr-5.20.0, hadoop: Amazon 2.8.5, Flink: 1.6.2) to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 3 core nodes. I have Flink Cluster running on YARN with the command: flink-yarn-session -tm 5120 -s 3 -jm 1024. This setup should ideally support 1 JM and 3 TM with 3 slots/TM. I have 3 Flink long running jobs which needs to be running all the time. I start these 3 flink jobs with parallelism of 3 for each. 2 TMs get allocated and the 3rd TM fails to come up. On YARN UI, I can see that 5GB of Memory and 1 vCore is reserved. Due to this container reservation, the 3rd job never starts. Is there any work around or a way to disable container reservation? Any help would be much appreciated! I've attached Flink JM logs and YARN RM logs. *JM Log contains:* INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{62845f24ec53185319bcd56d2a4abe8a}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. *The following logs repeat every second in YARN RM log file:* 2019-01-17 10:09:56,091 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (ResourceManager Event Processor): Trying to fulfill reservation for application application_1547647237123_0001 on node: ip-30-5-114-236.ec2.internal:8041 2019-01-17 10:09:56,091 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp (ResourceManager Event Processor): Application application_1547647237123_0001 unreserved on node host: ip-30-5-114-236.ec2.internal:8041 #containers=1 available=<memory:1024, vCores:2> used=<memory:5120, vCores:1>, currently has 0 at priority 0; currentReservation <memory:0, vCores:0> on node-label=CORE 2019-01-17 10:09:56,091 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1547647237123_0001_01_002151 Container Transitioned from NEW to RESERVED 2019-01-17 10:09:56,091 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (ResourceManager Event Processor): Reserved container application=application_1547647237123_0001 resource=<memory:5120, vCores:1> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3e6641a8 cluster=<memory:18432, vCores:9> 2019-01-17 10:09:56,091 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue (ResourceManager Event Processor): assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:18432, vCores:9> Thanks, Suraj yarn-yarn-resourcemanager-ip-30-5-113-161.log <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1706/yarn-yarn-resourcemanager-ip-30-5-113-161.log> job_manager.log <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1706/job_manager.log> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
Sharing new findings. The issue I have mentioned above seems to be happening only with the latest version of EMR(emr-5.20.0, hadoop: Amazon 2.8.5, Flink: 1.6.2) and it is reproducible with our setup every time. I have verified the same setup working and scaling without any issues on an older EMR version(emr-5.16.0, hadoop: Amazon 2.8.4, Flink: 1.5.0). Hope the above details help in resolving the issue and help others facing this issue. Regards, Suraj -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |