YARN reserved container prevents new Flink TMs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

YARN reserved container prevents new Flink TMs

suraj7
Hi,

I am using Amazon EMR(emr-5.20.0, hadoop: Amazon 2.8.5, Flink: 1.6.2) to run
Flink Cluster on YARN. My setup consists of m4.large instances for 1 master
and 3 core nodes. I have Flink Cluster running on YARN with the command:
flink-yarn-session -tm 5120 -s 3 -jm 1024. This setup should ideally support
1 JM and 3 TM with 3 slots/TM.

I have 3 Flink long running jobs which needs to be running all the time. I
start these 3 flink jobs with parallelism of 3 for each. 2 TMs get allocated
and the 3rd TM fails to come up. On YARN UI, I can see that 5GB of Memory
and 1 vCore is reserved. Due to this container reservation, the 3rd job
never starts. Is there any work around or a way to disable container
reservation?

Any help would be much appreciated!

I've attached Flink JM logs and YARN RM logs.

*JM Log contains:*
INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
Requesting new slot [SlotRequestId{62845f24ec53185319bcd56d2a4abe8a}] and
profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1,
directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource
manager.

*The following logs repeat every second in YARN RM log file:*
2019-01-17 10:09:56,091 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1547647237123_0001 on node:
ip-30-5-114-236.ec2.internal:8041
2019-01-17 10:09:56,091 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application
application_1547647237123_0001 unreserved  on node host:
ip-30-5-114-236.ec2.internal:8041 #containers=1 available=<memory:1024,
vCores:2> used=<memory:5120, vCores:1>, currently has 0 at priority 0;
currentReservation <memory:0, vCores:0> on node-label=CORE
2019-01-17 10:09:56,091 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1547647237123_0001_01_002151
Container Transitioned from NEW to RESERVED
2019-01-17 10:09:56,091 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1547647237123_0001 resource=<memory:5120, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3e6641a8
cluster=<memory:18432, vCores:9>
2019-01-17 10:09:56,091 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:18432, vCores:9>

Thanks,
Suraj

yarn-yarn-resourcemanager-ip-30-5-113-161.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1706/yarn-yarn-resourcemanager-ip-30-5-113-161.log>  
job_manager.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1706/job_manager.log>  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: YARN reserved container prevents new Flink TMs

suraj7
Hi,

Sharing new findings.
The issue I have mentioned above seems to be happening only with the latest
version of EMR(emr-5.20.0, hadoop: Amazon 2.8.5, Flink: 1.6.2) and it is
reproducible with our setup every time. I have verified the same setup
working and scaling without any issues on an older EMR version(emr-5.16.0,
hadoop: Amazon 2.8.4, Flink: 1.5.0).

Hope the above details help in resolving the issue and help others facing
this issue.

Regards,
Suraj



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/