DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

Marco Villalobos-2
I am running with one job manager and three task managers.

Each task manager is receiving at most 8 gb of data, but the job is timing out.

What parameters must I adjust?

Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd) switched from SCHEDULED to FAILED on [unassigned resource].
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_282]
    at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_282]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.Actor.aroundReceive(Actor.scala:517) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.Actor.aroundReceive$(Actor.scala:515) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.ActorCell.invoke(ActorCell.scala:561) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.run(Mailbox.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [feature-LUM-3882-toledo--850a6747.jar:?]
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    ... 26 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 300000 ms
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0
Reply | Threaded
Open this post in threaded view
|

Re: DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

Yangze Guo
Hi, Marco,

The root cause is NoResourceAvailableException. Could you provide the
following information?
- How many slots each TM has?
- Your job's topology, it would also be good to share the job manager log.

Best,
Yangze Guo

On Tue, May 25, 2021 at 12:10 PM Marco Villalobos
<[hidden email]> wrote:

>
> I am running with one job manager and three task managers.
>
> Each task manager is receiving at most 8 gb of data, but the job is timing out.
>
> What parameters must I adjust?
>
> Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd) switched from SCHEDULED to FAILED on [unassigned resource].
> java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
>     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_282]
>     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_282]
>     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_282]
>     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_282]
>     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_282]
>     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_282]
>     at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_282]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.actor.Actor.aroundReceive(Actor.scala:517) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.actor.Actor.aroundReceive$(Actor.scala:515) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.actor.ActorCell.invoke(ActorCell.scala:561) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.Mailbox.run(Mailbox.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [feature-LUM-3882-toledo--850a6747.jar:?]
>     at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [feature-LUM-3882-toledo--850a6747.jar:?]
> Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
>     at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
>     ... 26 more
> Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 300000 ms
>     at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0
Reply | Threaded
Open this post in threaded view
|

Re: DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

Piotr Nowojski-4
In reply to this post by Marco Villalobos-2
Hi Marco,

How are you starting the job? For example, are you using Yarn as the resource manager? It looks like there is just enough resources in the cluster to run this job. Assuming the cluster is correctly configured and Task Managers are able to connect with the Job Manager (can you share full JM/TM logs?), I would say your job is simply too large (32 parallelism?) for the given configuration.

Best,
Piotrek

wt., 25 maj 2021 o 06:10 Marco Villalobos <[hidden email]> napisał(a):
I am running with one job manager and three task managers.

Each task manager is receiving at most 8 gb of data, but the job is timing out.

What parameters must I adjust?

Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd) switched from SCHEDULED to FAILED on [unassigned resource].
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_282]
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_282]
    at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_282]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.Actor.aroundReceive(Actor.scala:517) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.Actor.aroundReceive$(Actor.scala:515) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.actor.ActorCell.invoke(ActorCell.scala:561) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.run(Mailbox.scala:225) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [feature-LUM-3882-toledo--850a6747.jar:?]
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [feature-LUM-3882-toledo--850a6747.jar:?]
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86) ~[feature-LUM-3882-toledo--850a6747.jar:?]
    ... 26 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 300000 ms
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0