Flink 1.6.0 not allocating specified TMs in Yarn

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.6.0 not allocating specified TMs in Yarn

Subramanya Suresh
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@...:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--


Screen Shot 2018-09-17 at 12.06.49 AM.png (179K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6.0 not allocating specified TMs in Yarn

Till Rohrmann
With Flink 1.6.0 it is no longer needed to specify the number of started containers (-yn 145). Flink will dynamically allocate containers. That's also the reason why you don't registered TMs without a running job. Moreover it it recommended to start every container with a single slot (no -ys 5). The parallelism should be controlled via the -p option or by the default parallelism configured in flink-conf.yaml.

The log snippet says that Flink started the TaskManagers. But it seems as if they could not register at the ResourceManger or could never be started. Could you check the TM logs to see what they say. If there is nothing suspicious, then it would be helpful if you could share the complete logs with us.

Cheers,
Till



On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <[hidden email]> wrote:
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@...:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6.0 not allocating specified TMs in Yarn

Subramanya Suresh
Thanks Till, 

"That's also the reason why you don't registered TMs without a running job." 
> I am not sure what you mean. We see 0 TMs in Flink (attached earlier and also in the TaskManagers link) despite running/submitting the Job (the RM seems to show lot of containers though, attached) 
> Also not sure where I get the logs from though without seeing a running TM/Container. 

How do I restrict the number of containers/cores per container. Seems like -ytm is just a suggestion. I assume parallelism is within the realm of a single container, so I would use 5 to say I want 5 cores within one TM ? Is that again a suggestion only ?
I see maxParallelism (set in code only) but that could be 8, if the parallelism I specify is 5. 

Sincerely, 

On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <[hidden email]> wrote:
With Flink 1.6.0 it is no longer needed to specify the number of started containers (-yn 145). Flink will dynamically allocate containers. That's also the reason why you don't registered TMs without a running job. Moreover it it recommended to start every container with a single slot (no -ys 5). The parallelism should be controlled via the -p option or by the default parallelism configured in flink-conf.yaml.

The log snippet says that Flink started the TaskManagers. But it seems as if they could not register at the ResourceManger or could never be started. Could you check the TM logs to see what they say. If there is nothing suspicious, then it would be helpful if you could share the complete logs with us.

Cheers,
Till



On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <[hidden email]> wrote:
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--




--


Screen Shot 2018-09-17 at 12.03.44 PM.png (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6.0 not allocating specified TMs in Yarn

Subramanya Suresh
I got these logs from one of the Yarn logs. Not sure what changed in 1.6.0, couldn't find anything relevant in the release notes. 
Looking through the code i am not sure the JVM Heap Size is < 8GB. We start the TM with 20GB, so with the cutoff we should have totalJavaMemorySizeMB = 20GB - 5GB i.e. 15GB which is greater than the 8GB.

2018-09-17 16:06:13,728 ERROR org.apache.flink.yarn.YarnTaskExecutorRunner                  - YARN TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Invalid configuration value for (taskmanager.network.memory.fraction, taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1, 8000000000, 12000000000) - Network buffer memory size too large: 8000000000 >= 7769948160(maximum JVM heap size)

Please also see my questions above. 

Cheers, 

On Mon, Sep 17, 2018 at 12:19 PM, Subramanya Suresh <[hidden email]> wrote:
Thanks Till, 

"That's also the reason why you don't registered TMs without a running job." 
> I am not sure what you mean. We see 0 TMs in Flink (attached earlier and also in the TaskManagers link) despite running/submitting the Job (the RM seems to show lot of containers though, attached) 
> Also not sure where I get the logs from though without seeing a running TM/Container. 

How do I restrict the number of containers/cores per container. Seems like -ytm is just a suggestion. I assume parallelism is within the realm of a single container, so I would use 5 to say I want 5 cores within one TM ? Is that again a suggestion only ?
I see maxParallelism (set in code only) but that could be 8, if the parallelism I specify is 5. 

Sincerely, 

On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <[hidden email]> wrote:
With Flink 1.6.0 it is no longer needed to specify the number of started containers (-yn 145). Flink will dynamically allocate containers. That's also the reason why you don't registered TMs without a running job. Moreover it it recommended to start every container with a single slot (no -ys 5). The parallelism should be controlled via the -p option or by the default parallelism configured in flink-conf.yaml.

The log snippet says that Flink started the TaskManagers. But it seems as if they could not register at the ResourceManger or could never be started. Could you check the TM logs to see what they say. If there is nothing suspicious, then it would be helpful if you could share the complete logs with us.

Cheers,
Till



On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <[hidden email]> wrote:
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--




--




--

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6.0 not allocating specified TMs in Yarn

Till Rohrmann
Hi Subramanya,

you can get the logs from Yarn if you enabled log aggregation. If it does not contain any TM logs, then they were not started.

If Yarn started containers but you don't see them connected to Flink's ResourceManager, then the TaskManagers either did not start up or they have problems connecting to the ResourceManager. In order to debug this problem, the logs would be helpful.

You can configure the cores per container by setting `yarn.containers.vcores` in your flink-conf.yaml. If this value is not specified, then it will use the number of slots per TM.

In order to debug the memory settings problem it would be helpful to either get the full logs or the configuration and the command with which you started the Flink cluster. From the log snippet it looks as if Flink only got 8GB of memory assigned.

Cheers,
Till

On Mon, Sep 17, 2018 at 11:34 PM Subramanya Suresh <[hidden email]> wrote:
I got these logs from one of the Yarn logs. Not sure what changed in 1.6.0, couldn't find anything relevant in the release notes. 
Looking through the code i am not sure the JVM Heap Size is < 8GB. We start the TM with 20GB, so with the cutoff we should have totalJavaMemorySizeMB = 20GB - 5GB i.e. 15GB which is greater than the 8GB.

2018-09-17 16:06:13,728 ERROR org.apache.flink.yarn.YarnTaskExecutorRunner                  - YARN TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Invalid configuration value for (taskmanager.network.memory.fraction, taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1, 8000000000, 12000000000) - Network buffer memory size too large: 8000000000 >= 7769948160(maximum JVM heap size)

Please also see my questions above. 

Cheers, 

On Mon, Sep 17, 2018 at 12:19 PM, Subramanya Suresh <[hidden email]> wrote:
Thanks Till, 

"That's also the reason why you don't registered TMs without a running job." 
> I am not sure what you mean. We see 0 TMs in Flink (attached earlier and also in the TaskManagers link) despite running/submitting the Job (the RM seems to show lot of containers though, attached) 
> Also not sure where I get the logs from though without seeing a running TM/Container. 

How do I restrict the number of containers/cores per container. Seems like -ytm is just a suggestion. I assume parallelism is within the realm of a single container, so I would use 5 to say I want 5 cores within one TM ? Is that again a suggestion only ?
I see maxParallelism (set in code only) but that could be 8, if the parallelism I specify is 5. 

Sincerely, 

On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <[hidden email]> wrote:
With Flink 1.6.0 it is no longer needed to specify the number of started containers (-yn 145). Flink will dynamically allocate containers. That's also the reason why you don't registered TMs without a running job. Moreover it it recommended to start every container with a single slot (no -ys 5). The parallelism should be controlled via the -p option or by the default parallelism configured in flink-conf.yaml.

The log snippet says that Flink started the TaskManagers. But it seems as if they could not register at the ResourceManger or could never be started. Could you check the TM logs to see what they say. If there is nothing suspicious, then it would be helpful if you could share the complete logs with us.

Cheers,
Till



On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <[hidden email]> wrote:
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@...:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager [hidden email]://flink@...:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--




--




--

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6.0 not allocating specified TMs in Yarn

Subramanya Suresh
Hi Till, 
How do we limit the number of TMs/containers allocated. Seems like the number of TaskManagers I specify with -ytm just a suggestion and flink allocates TMs/Containers dynamically. Thanks for your answer on limiting the number of slots. The startup command is the below that was used, 

Here is all the details on the job. Curious to understand the numbers below. I do see Maximum heap size: 7410 MiBytes for TM below, so the error makes sense, but question is why is it 7410, and what changed in 1.6.0 from 1.4.2. 

nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Relevant config in flink-conf.yaml
taskmanager.memory.preallocate: false
taskmanager.network.memory.min
: 8000000000
taskmanager.network.memory.max: 12000000000



The JobManager Logs
2018-09-17 06:53:14,197 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting YarnJobClusterEntrypoint (Version: 1.6.0, Rev:<unknown>, Date:<unknown>)
2018-09-17 06:53:14,197 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: yarn
2018-09-17 06:53:14,594 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: hello-world-app
2018-09-17 06:53:14,594 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-09-17 06:53:14,594 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 13333 MiBytes
2018-09-17 06:53:14,594 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /usr/java/jdk1.8.0_181-amd64
2018-09-17 06:53:14,596 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.6.0-cdh5.11.2
2018-09-17 06:53:14,596 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-09-17 06:53:14,596 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Xmx15000m
2018-09-17 06:53:14,596 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog.file=/data-1/yarn/container-logs/application_1536964973951_0247/container_e31_1536964973951_0247_01_000003/jobmanager.log
2018-09-17 06:53:14,596 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:logback.xml
2018-09-17 06:53:14,597 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:log4j.properties
2018-09-17 06:53:14,597 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments: (none)

2018-09-17 06:53:14,601 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - YARN daemon is running as: hello-world-app Yarn client user obtainer: hello-world-app
2018-09-17 06:53:14,603 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.web.address, hello-world9-1-crz
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.web.port, 8082
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.lookup.timeout, 600s
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.cluster-id, application_1536964973951_0247
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, localhost
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.memory.max, 12000000000
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.request-backoff.max, 30000
2018-09-17 06:53:14,604 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink/test
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs:///streaming-searches/test/recovery
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.watch.heartbeat.pause, 120s
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: yarn.application-attempts, 10
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporters, FlinkArgusReporter
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: yarn.reallocate-failed, true
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 5
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.ask.timeout, 600s
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.archive.fs.dir, hdfs:///streaming-searches/test/completed-jobs/
2018-09-17 06:53:14,605 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.memory.min, 8000000000
2018-09-17 06:53:14,606 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 20000m
2018-09-17 06:53:14,606 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.archive.fs.dir, hdfs:///streaming-searches/test/completed-jobs/
2018-09-17 06:53:14,606 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.port, 8081
2018-09-17 06:53:14,606 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.superpod, DEV
2018-09-17 06:53:14,606 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.archive.fs.refresh-interval, 10000
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.preallocate, false
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.service_name, flink-argus-service-test
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.whitelist, _Custom_Source.,.Sink-_Unnamed.,JVM,Network,jobmanager
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.framesize, 2000000000b
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, hello-world3-2-ops.net:2181,hello-world4-1-ops.net:2181,hello-world7-1-ops.net:2181,hello-world8-1-ops.net:2181,hello-world9-1-ops.net:2181
2018-09-17 06:53:14,607 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.funnel_url, http://ajna0-funnel1-0-prd.data.sfdc.net:80
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: internal.cluster.execution-mode, DETACHED
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.exit-on-fatal-akka-error, true
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.class, com.salesforce.sde.flinkargusreporter.FlinkArgusReporter
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.datacenter, CRZ
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.tcp.timeout, 60s
2018-09-17 06:53:14,608 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.pod, na1
2018-09-17 06:53:14,609 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 20000m
2018-09-17 06:53:14,609 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.client.timeout, 600s


The TaskManager Logs 
2018-09-17 17:29:10,919 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  Starting YARN TaskExecutor runner (Version: 1.6.0, Rev:<unknown>, Date:<unknown>)
2018-09-17 17:29:10,919 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  OS current user: yarn
2018-09-17 17:29:11,312 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  Current Hadoop/Kerberos user: hello-world-app
2018-09-17 17:29:11,313 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-09-17 17:29:11,313 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  Maximum heap size: 7410 MiBytes
2018-09-17 17:29:11,313 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  JAVA_HOME: /usr/java/jdk1.8.0_181-amd64
2018-09-17 17:29:11,315 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  Hadoop version: 2.6.0-cdh5.11.2
2018-09-17 17:29:11,315 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -  JVM Options:
2018-09-17 17:29:11,315 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Xms7731m
2018-09-17 17:29:11,315 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Xmx7731m
2018-09-17 17:29:11,316 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -XX:MaxDirectMemorySize=12749m
2018-09-17 17:29:11,316 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlog.file=/fastdata-0/yarn/container-logs/application_1536964973951_0247/container_e31_1536964973951_0247_01_1497943/taskmanager.log
2018-09-17 17:29:11,316 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlogback.configurationFile=file:./logback.xml
2018-09-17 17:29:11,316 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlog4j.configuration=file:./log4j.properties


2018-09-17 17:29:11,320 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  - Current working Directory: /fastdata-0/yarn/nm/usercache/hello-world-app/appcache/application_1536964973951_0247/container_e31_1536964973951_0247_01_1497943
2018-09-17 17:29:11,320 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  - TM: remote keytab path obtained null
2018-09-17 17:29:11,320 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  - TM: remote keytab principal obtained null
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.web.address, hello-world9-1-crz
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.web.port, 8082
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.lookup.timeout, 600s
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.cluster-id, application_1536964973951_0247
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, hello-world4-30-ops.net
2018-09-17 17:29:11,323 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.memory.max, 12000000000
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.request-backoff.max, 30000
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink/test
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs:///streaming-searches/test/recovery
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.watch.heartbeat.pause, 120s
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: yarn.application-attempts, 10
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporters, FlinkArgusReporter
2018-09-17 17:29:11,324 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: yarn.reallocate-failed, true
2018-09-17 17:29:11,325 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 5
2018-09-17 17:29:11,325 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.ask.timeout, 600s
2018-09-17 17:29:11,325 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.archive.fs.dir, hdfs:///streaming-searches/test/completed-jobs/
2018-09-17 17:29:11,325 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.network.memory.min, 8000000000
2018-09-17 17:29:11,325 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 20000m
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.archive.fs.dir, hdfs:///streaming-searches/test/completed-jobs/
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.port, 0
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.superpod, DEV
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.tmpdir, /tmp/flink-web-7f966eeb-f7b2-4d5b-bbb7-12a0c1e9c2fd
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: historyserver.archive.fs.refresh-interval, 10000
018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 41135
2018-09-17 17:29:11,326 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.memory.preallocate, false
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.service_name, flink-argus-service-test
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 0
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.whitelist, _Custom_Source.,.Sink-_Unnamed.,JVM,Network,jobmanager
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.framesize, 2000000000b
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, hello-world3-2-ops.net:2181,hello-world4-1-ops.net:2181,hello-world7-1-ops.net:2181,hello-world8-1-ops.net:2181,hello-world9-1-ops.net:2181
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.funnel_url, http://ajna0-funnel1-0-prd.data.sfdc.net:80
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: internal.cluster.execution-mode, DETACHED
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-09-17 17:29:11,327 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.exit-on-fatal-akka-error, true
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.class, com.salesforce.sde.flinkargusreporter.FlinkArgusReporter
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.datacenter, CRZ
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.address, hello-world4-30-ops.net
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.tcp.timeout, 60s
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: metrics.reporter.FlinkArgusReporter.pod, na1
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 20000m
2018-09-17 17:29:11,328 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: akka.client.timeout, 600s








On Tue, Sep 18, 2018 at 1:24 AM, Till Rohrmann <[hidden email]> wrote:
Hi Subramanya,

you can get the logs from Yarn if you enabled log aggregation. If it does not contain any TM logs, then they were not started.

If Yarn started containers but you don't see them connected to Flink's ResourceManager, then the TaskManagers either did not start up or they have problems connecting to the ResourceManager. In order to debug this problem, the logs would be helpful.

You can configure the cores per container by setting `yarn.containers.vcores` in your flink-conf.yaml. If this value is not specified, then it will use the number of slots per TM.

In order to debug the memory settings problem it would be helpful to either get the full logs or the configuration and the command with which you started the Flink cluster. From the log snippet it looks as if Flink only got 8GB of memory assigned.

Cheers,
Till

On Mon, Sep 17, 2018 at 11:34 PM Subramanya Suresh <[hidden email]> wrote:
I got these logs from one of the Yarn logs. Not sure what changed in 1.6.0, couldn't find anything relevant in the release notes. 
Looking through the code i am not sure the JVM Heap Size is < 8GB. We start the TM with 20GB, so with the cutoff we should have totalJavaMemorySizeMB = 20GB - 5GB i.e. 15GB which is greater than the 8GB.

2018-09-17 16:06:13,728 ERROR org.apache.flink.yarn.YarnTaskExecutorRunner                  - YARN TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Invalid configuration value for (taskmanager.network.memory.fraction, taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1, 8000000000, 12000000000) - Network buffer memory size too large: 8000000000 >= 7769948160(maximum JVM heap size)

Please also see my questions above. 

Cheers, 

On Mon, Sep 17, 2018 at 12:19 PM, Subramanya Suresh <[hidden email]> wrote:
Thanks Till, 

"That's also the reason why you don't registered TMs without a running job." 
> I am not sure what you mean. We see 0 TMs in Flink (attached earlier and also in the TaskManagers link) despite running/submitting the Job (the RM seems to show lot of containers though, attached) 
> Also not sure where I get the logs from though without seeing a running TM/Container. 

How do I restrict the number of containers/cores per container. Seems like -ytm is just a suggestion. I assume parallelism is within the realm of a single container, so I would use 5 to say I want 5 cores within one TM ? Is that again a suggestion only ?
I see maxParallelism (set in code only) but that could be 8, if the parallelism I specify is 5. 

Sincerely, 

On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <[hidden email]> wrote:
With Flink 1.6.0 it is no longer needed to specify the number of started containers (-yn 145). Flink will dynamically allocate containers. That's also the reason why you don't registered TMs without a running job. Moreover it it recommended to start every container with a single slot (no -ys 5). The parallelism should be controlled via the -p option or by the default parallelism configured in flink-conf.yaml.

The log snippet says that Flink started the TaskManagers. But it seems as if they could not register at the ResourceManger or could never be started. Could you check the TM logs to see what they say. If there is nothing suspicious, then it would be helpful if you could share the complete logs with us.

Cheers,
Till



On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <[hidden email]> wrote:
Hi, 
Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we were facing with 1.4.2. I got our Yarn cluster setup and launched our job with the command mentioned below

Symptoms:
  • The CLI logs say the Job is submitted but Yarn ResourceManager says only 1 container allocated, that goes up on refresh and then a subsequent refresh shows it back to 1 container allocated. 
  • The UI consistently shows 0 TMs and 0 Slots (see attached). 
  • The exceptions in the UI, shows the below NoResourceAvailalbleException. 
  • Also see below the JobManager logs. 
So not sure what gives ? I was able to launch the same job in 1.4.2 and immediately get the mentioned TMs and have the job working as it should. 
 
Job Submit Parameters:
nohup $FLINK_BINARY run \
    -m yarn-cluster \
    -c $FLINK_JOB_CLASSNAME \
    -yst \
    -ys 5 \
    -yn 145 \
    -yjm 20000 \
    -ytm 20000 \
    -ynm $YARN_APPLICATION_NAME \
    -d $FLINK_JOB_JAR \
            > $FLINK_JOB_LOGS/stdout.log \
            2> $FLINK_JOB_LOGS/stderr.log \
            & echo $! > $FLINK_JOB_LOGS/current-run.pid

Exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0

Yarn JobManager Logs:

2018-09-17 06:53:18,041 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
2018-09-17 06:53:18,045 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
2018-09-17 06:53:18,046 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-09-17 06:53:18,048 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
2018-09-17 06:53:18,048 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,060 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registered job manager 8a7f0e49aa68e867ef8f058c46414d[hidden email]://flink@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 9a62f56ce988f5499dbe1d09bd894b8a.
2018-09-17 06:53:18,062 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-09-17 06:53:18,064 INFO  org.apache.flink.yarn.YarnResourceManager                     - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
2018-09-17 06:53:18,071 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20000, vCores:5>. Number pending requests 1.
2018-09-17 06:53:23,191 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000005 - Remaining pending container requests: 1
2018-09-17 06:53:23,602 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:23,603 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:34,193 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:39,696 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000006 - Remaining pending container requests: 1
2018-09-17 06:53:40,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:40,270 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:45,703 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:53:51,209 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000007 - Remaining pending container requests: 1
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:53:51,365 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:53:51,383 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000009 - Remaining pending container requests: 0
2018-09-17 06:53:51,385 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000009.
2018-09-17 06:54:01,714 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:07,217 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000011 - Remaining pending container requests: 1
2018-09-17 06:54:07,263 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:07,266 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000012 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000012.
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000013 - Remaining pending container requests: 0
2018-09-17 06:54:07,276 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000013.
2018-09-17 06:54:12,720 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:18,221 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000016 - Remaining pending container requests: 1
2018-09-17 06:54:18,256 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:18,257 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000017 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000017.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000018 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000018.
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000020 - Remaining pending container requests: 0
2018-09-17 06:54:18,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000020.
2018-09-17 06:54:28,726 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:34,229 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000021 - Remaining pending container requests: 1
2018-09-17 06:54:34,268 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:34,269 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000022 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000022.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000024 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000024.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000025 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000025.
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000028 - Remaining pending container requests: 0
2018-09-17 06:54:34,285 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000028.
2018-09-17 06:54:39,731 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:54:45,236 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_e31_1536964973951_0247_01_000042 - Remaining pending container requests: 1
2018-09-17 06:54:45,281 INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating container launch context for TaskManagers
2018-09-17 06:54:45,282 INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting TaskManagers




2018-09-17 06:58:08,291 INFO  org.apache.flink.yarn.YarnResourceManager                     - Returning excess container container_e31_1536964973951_0247_01_000595.
2018-09-17 06:58:13,403 INFO  org.apache.flink.yarn.YarnResourceManager                     - Requesting new TaskExecutor container with resources <memory:20480, vCores:5>. Number pending requests 1.
2018-09-17 06:58:18,045 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
2018-09-17 06:58:18,047 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

Sincerely, 

--




--




--




--