All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Juho Autio
I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state so that the job didn't manage to (re)start but what makes it worse is that it didn't exit as failed either.

Next time I tried running the same job (but new EMR cluster & all from scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - GET operation failed
java.io.EOFException: Premature end of GET request
at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get stuck like this? Seems like a Flink bug, at least that it just gets stuck and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Gary Yao-2
Hi Juho,

The log message

  Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8

indicates that you do not have enough resources in your cluster left. Can you
verify that after you started the job submission the YARN cluster does not reach
its maximum capacity? You can also try submitting the job with a lower
parallelism.

I think the reason why the YARN application is not immediately shown as failed
is that your restart strategy attempts to start the job 3 times. On every
attempt the job is blocked on the slot allocation timeout for at least 300000 ms
(5 minutes). I have tried submitting examples/streaming/WordCount.jar with the
same restart strategy on EMR, and the CLI only returns after around 20 minutes.

As a side note, beginning from Flink 1.5, you do not need to specify -yn -ys
because resource allocations are dynamic by default (FLIP-6). The parameter -yst
is deprecated and should not be needed either.

Best,
Gary

On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <[hidden email]> wrote:
I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state so that the job didn't manage to (re)start but what makes it worse is that it didn't exit as failed either.

Next time I tried running the same job (but new EMR cluster & all from scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - GET operation failed
java.io.EOFException: Premature end of GET request
at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get stuck like this? Seems like a Flink bug, at least that it just gets stuck and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Juho Autio
Thanks again, Gary.

It's true that I only let the job remain in the stuck state for something between 10-15 minutes. Then I shut down the cluster.

But: if restart strategy is being applied, shouldn't I have seen those messages in job manager log? In my case it kept all quiet since ~2018-03-28 15:27 and I terminated it at ~28-03-2018 15:36.

Do you happen to know about what that BlobServerConnection error means in the code? If it may lead into some unrecoverable state (where neither restart is attempted, nor job is failed for good)..

On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <[hidden email]> wrote:
Hi Juho,

The log message

  Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8

indicates that you do not have enough resources in your cluster left. Can you
verify that after you started the job submission the YARN cluster does not reach
its maximum capacity? You can also try submitting the job with a lower
parallelism.

I think the reason why the YARN application is not immediately shown as failed
is that your restart strategy attempts to start the job 3 times. On every
attempt the job is blocked on the slot allocation timeout for at least 300000 ms
(5 minutes). I have tried submitting examples/streaming/WordCount.jar with the
same restart strategy on EMR, and the CLI only returns after around 20 minutes.

As a side note, beginning from Flink 1.5, you do not need to specify -yn -ys
because resource allocations are dynamic by default (FLIP-6). The parameter -yst
is deprecated and should not be needed either.

Best,
Gary

On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <[hidden email]> wrote:
I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state so that the job didn't manage to (re)start but what makes it worse is that it didn't exit as failed either.

Next time I tried running the same job (but new EMR cluster & all from scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - GET operation failed
java.io.EOFException: Premature end of GET request
at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get stuck like this? Seems like a Flink bug, at least that it just gets stuck and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Juho Autio
Sorry, my bad. I checked the persisted jobmanager logs and can see that job was still being restarted at 15:31 and then at 15:36. If I wouldn't have terminated the cluster, I believe the flink job / yarn app would've eventually exited as failed.

On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <[hidden email]> wrote:
Thanks again, Gary.

It's true that I only let the job remain in the stuck state for something between 10-15 minutes. Then I shut down the cluster.

But: if restart strategy is being applied, shouldn't I have seen those messages in job manager log? In my case it kept all quiet since ~2018-03-28 15:27 and I terminated it at ~28-03-2018 15:36.

Do you happen to know about what that BlobServerConnection error means in the code? If it may lead into some unrecoverable state (where neither restart is attempted, nor job is failed for good)..

On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <[hidden email]> wrote:
Hi Juho,

The log message

  Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8

indicates that you do not have enough resources in your cluster left. Can you
verify that after you started the job submission the YARN cluster does not reach
its maximum capacity? You can also try submitting the job with a lower
parallelism.

I think the reason why the YARN application is not immediately shown as failed
is that your restart strategy attempts to start the job 3 times. On every
attempt the job is blocked on the slot allocation timeout for at least 300000 ms
(5 minutes). I have tried submitting examples/streaming/WordCount.jar with the
same restart strategy on EMR, and the CLI only returns after around 20 minutes.

As a side note, beginning from Flink 1.5, you do not need to specify -yn -ys
because resource allocations are dynamic by default (FLIP-6). The parameter -yst
is deprecated and should not be needed either.

Best,
Gary

On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <[hidden email]> wrote:
I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state so that the job didn't manage to (re)start but what makes it worse is that it didn't exit as failed either.

Next time I tried running the same job (but new EMR cluster & all from scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - GET operation failed
java.io.EOFException: Premature end of GET request
at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get stuck like this? Seems like a Flink bug, at least that it just gets stuck and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.



Reply | Threaded
Open this post in threaded view
|

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Gary Yao-2
Hi Juho,

Thanks for the follow up. Regarding the BlobServerConnection error, Nico (cc'ed)
might have an idea. 

Best,
Gary

On Thu, Mar 29, 2018 at 4:08 PM, Juho Autio <[hidden email]> wrote:
Sorry, my bad. I checked the persisted jobmanager logs and can see that job was still being restarted at 15:31 and then at 15:36. If I wouldn't have terminated the cluster, I believe the flink job / yarn app would've eventually exited as failed.


On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <[hidden email]> wrote:
Thanks again, Gary.

It's true that I only let the job remain in the stuck state for something between 10-15 minutes. Then I shut down the cluster.

But: if restart strategy is being applied, shouldn't I have seen those messages in job manager log? In my case it kept all quiet since ~2018-03-28 15:27 and I terminated it at ~28-03-2018 15:36.

Do you happen to know about what that BlobServerConnection error means in the code? If it may lead into some unrecoverable state (where neither restart is attempted, nor job is failed for good)..

On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <[hidden email]> wrote:
Hi Juho,

The log message

  Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8

indicates that you do not have enough resources in your cluster left. Can you
verify that after you started the job submission the YARN cluster does not reach
its maximum capacity? You can also try submitting the job with a lower
parallelism.

I think the reason why the YARN application is not immediately shown as failed
is that your restart strategy attempts to start the job 3 times. On every
attempt the job is blocked on the slot allocation timeout for at least 300000 ms
(5 minutes). I have tried submitting examples/streaming/WordCount.jar with the
same restart strategy on EMR, and the CLI only returns after around 20 minutes.

As a side note, beginning from Flink 1.5, you do not need to specify -yn -ys
because resource allocations are dynamic by default (FLIP-6). The parameter -yst
is deprecated and should not be needed either.

Best,
Gary

On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <[hidden email]> wrote:
I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state so that the job didn't manage to (re)start but what makes it worse is that it didn't exit as failed either.

Next time I tried running the same job (but new EMR cluster & all from scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8
at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - GET operation failed
java.io.EOFException: Premature end of GET request
at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get stuck like this? Seems like a Flink bug, at least that it just gets stuck and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.




Reply | Threaded
Open this post in threaded view
|

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Nico Kruber
That BlobServerConnection error is caused by a TaskManager which
requested a BLOB (a jar file) but then closed the connection. I guess
that may happen when the job is cancelled and the TaskManager processes
are terminated.

If this is not happening during that scenario, then your TaskManager
probably died from something else, but since you didn't see anything in
the logs there, I don't think this is an issue.


Nico

On 29/03/18 16:24, Gary Yao wrote:

> Hi Juho,
>
> Thanks for the follow up. Regarding the BlobServerConnection error, Nico
> (cc'ed)
> might have an idea. 
>
> Best,
> Gary
>
> On Thu, Mar 29, 2018 at 4:08 PM, Juho Autio <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Sorry, my bad. I checked the persisted jobmanager logs and can see
>     that job was still being restarted at 15:31 and then at 15:36. If I
>     wouldn't have terminated the cluster, I believe the flink job / yarn
>     app would've eventually exited as failed.
>
>
>     On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         Thanks again, Gary.
>
>         It's true that I only let the job remain in the stuck state for
>         something between 10-15 minutes. Then I shut down the cluster.
>
>         But: if restart strategy is being applied, shouldn't I have seen
>         those messages in job manager log? In my case it kept all quiet
>         since ~2018-03-28 15:27 and I terminated it at ~28-03-2018 15:36.
>
>         Do you happen to know about what that BlobServerConnection error
>         means in the code? If it may lead into some unrecoverable state
>         (where neither restart is attempted, nor job is failed for good)..
>
>         On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao
>         <[hidden email] <mailto:[hidden email]>> wrote:
>
>             Hi Juho,
>
>             The log message
>
>               Could not allocate all requires slots within timeout of
>             300000 ms. Slots required: 20, slots allocated: 8
>
>             indicates that you do not have enough resources in your
>             cluster left. Can you
>             verify that after you started the job submission the YARN
>             cluster does not reach
>             its maximum capacity? You can also try submitting the job
>             with a lower
>             parallelism.
>
>             I think the reason why the YARN application is not
>             immediately shown as failed
>             is that your restart strategy attempts to start the job 3
>             times. On every
>             attempt the job is blocked on the slot allocation timeout
>             for at least 300000 ms
>             (5 minutes). I have tried submitting
>             examples/streaming/WordCount.jar with the
>             same restart strategy on EMR, and the CLI only returns after
>             around 20 minutes.
>
>             As a side note, beginning from Flink 1.5, you do not need to
>             specify -yn -ys
>             because resource allocations are dynamic by default
>             (FLIP-6). The parameter -yst
>             is deprecated and should not be needed either.
>
>             Best,
>             Gary
>
>             On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio
>             <[hidden email] <mailto:[hidden email]>> wrote:
>
>                 I built a new Flink distribution from release-1.5 branch
>                 yesterday.
>
>                 The first time I tried to run a job with it ended up in
>                 some stalled state so that the job didn't manage to
>                 (re)start but what makes it worse is that it didn't exit
>                 as failed either.
>
>                 Next time I tried running the same job (but new EMR
>                 cluster & all from scratch) it just worked normally.
>
>                 On the problematic run, The YARN job was started and
>                 Flink UI was being served, but Flink UI kept showing
>                 status CREATED for all sub-tasks and nothing seemed to
>                 be happening.
>
>                 I found this in Job manager log first (could be unrelated) :
>
>                 2018-03-28 15:26:17,449 INFO 
>                 org.apache.flink.runtime.executiongraph.ExecutionGraph 
>                       - Job UniqueIdStream
>                 (43ed4ace55974d3c486452a45ee5db93) switched from state
>                 RUNNING to FAILING.
>                 org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>                 Could not allocate all requires slots within timeout of
>                 300000 ms. Slots required: 20, slots allocated: 8
>                 at
>                 org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
>                 at
>                 java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>                 at
>                 java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>                 at
>                 java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>                 at
>                 java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>                 at
>                 org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
>                 at
>                 java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>                 at
>                 java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>                 at
>                 java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>                 at
>                 java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>                 at
>                 org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
>                 at akka.dispatch.OnComplete.internal(Future.scala:258)
>                 at akka.dispatch.OnComplete.internal(Future.scala:256)
>                 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>                 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>                 at
>                 scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>                 at
>                 org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>                 at
>                 scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>                 at
>                 scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>                 at
>                 akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>                 at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>                 at
>                 scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>                 at
>                 scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>                 at
>                 scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>                 at
>                 akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>                 at
>                 akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>                 at
>                 akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>                 at
>                 akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>                 at java.lang.Thread.run(Thread.java:748)
>
>
>                 After this there was:
>
>                 2018-03-28 15:26:17,521 INFO 
>                 org.apache.flink.runtime.executiongraph.ExecutionGraph 
>                       - Restarting the job UniqueIdStream
>                 (43ed4ace55974d3c486452a45ee5db93).
>
>
>                 And some time after that:
>
>                 2018-03-28 15:27:39,125 ERROR
>                 org.apache.flink.runtime.blob.BlobServerConnection     
>                       - GET operation failed
>                 java.io.EOFException: Premature end of GET request
>                 at
>                 org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
>                 at
>                 org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)
>
>
>                 Task manager logs didn't have any errors.
>
>
>                 Is that error about BlobServerConnection severe enough
>                 to make the job get stuck like this? Seems like a Flink
>                 bug, at least that it just gets stuck and doesn't either
>                 restart or make the YARN app exit as failed?
>
>
>
>                 My launch command is basically:
>
>                 flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn
>                 ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm
>                 ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst
>                 -yD restart-strategy=fixed-delay -yD
>                 restart-strategy.fixed-delay.attempts=3 -yD
>                 "restart-strategy.fixed-delay.delay=30 s" -p
>                 ${PARALLELISM} $@
>
>
>                 I'm also setting this to fix some classloading error
>                 (with the previous build that still works)
>                 -yD.classloader.resolve-order=parent-first
>
>
>                 Cluster was AWS EMR, release 5.12.0.
>
>                 Thanks.
>
>
>
>
>
--
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen


signature.asc (201 bytes) Download Attachment