Hi all,
We have a Flink 1.6 streaming application running on Amazon EMR, with a YARN session configured with 20GB for the Task Manager, 2GB for the Job Manager, and 4 slots (number of vCPUs), in detached mode. Each Core Node has 4 vCores, 32 GB mem, 32 GB disc, and each Task Node has 4 vCores, 8 GB mem, 32 GB disc. We have auto-scaling for Core Nodes based on the HDFS Utilization and Capacity Remaining GB, as well as auto-scaling for the Task Nodes based on YARN Available Memory and the number of Pending Containers. We've got Log Aggregation turned on as well. This runs well under normal pressure for about a week, where upon YARN can no longer allocate the resource requests from Flink, causing container requests to build up. Even when scaled up, the container requests don't seem to be fulfilled. I've seen that it seems to start. Does anyone have a good guide to setting up a streaming application on EMR with YARN? Thank you, Austin Cawley-Edwards |
On Tue, Dec 4, 2018 at 11:24 AM Austin Cawley-Edwards <[hidden email]> wrote:
|
Perhaps related to this, one of my Tasks does not seem to be restoring correctly / check pointing. It hangs during the checkpoint process and then causes a timeout and then says "Checkpoint Coordinator is suspended." I have increased the "slot.idel.timeout" as was recommended here, and though it lasted longer, the checkpoint still failed.
Thanks, Austin On Tue, Dec 4, 2018 at 12:24 PM Austin Cawley-Edwards <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |