Hi, I am running Mesos without DC/OS [1] and Flink on it. Whe I start my cluster I receive some messages that I suppose everything was started. However, I see 0 slats available on the Flink web dashboard. But I suppose that Mesos will allocate Slots and Task Managers dynamically. Is that right? $ ./bin/mesos-appmaster.sh & [1] 16723 flink@r03:~/flink-1.9.0$ I0906 10:22:45.080328 16943 sched.cpp:239] Version: 1.9.0 I0906 10:22:45.082672 16996 sched.cpp:343] New master detected at [hidden email]:5050 I0906 10:22:45.083276 16996 sched.cpp:363] No credentials provided. Attempting to register without authentication I0906 10:22:45.086840 16997 sched.cpp:751] Framework registered with 22f6a553-e8ac-42d4-9a90-96a8d5f002f0-0003 Then I deploy my Flink application. When I use the first command to deploy the application starts. However, the tasks remain CREATED until Flink throws a timeout exception. In other words, it never turns to RUNNING. When I use the second comman to deploy the application it does not start and I receive the exception of "Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2". The full stacktrace is below. $ /home/flink/flink-1.9.0/bin/flink run /home/flink/hello-flink-mesos/target/hello-flink-mesos.jar & $ ./bin/mesos-appmaster-job.sh run /home/flink/hello-flink-mesos/target/hello-flink-mesos.jar & [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/mesos.html#mesos-without-dcos ps.: my application runs normally on a standalone Flink cluster.------------------------------------------------------------ The program finished with the following exception: org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 7ad8d71faaceb1ac469353452c43dc2a) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) at org.hello_flink_mesos.App.<init>(App.java:35) at org.hello_flink_mesos.App.main(App.java:285) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) ... 22 more Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 0, previous allocation IDs: [], execution status: completed exceptionally: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException/java.util.concurrent.CompletableFuture@b520de3[Completed exceptionally], incomplete: java.util.concurrent.CompletableFuture@36f3d30c[Not completed, 1 dependents] at org.apache.flink.runtime.executiongraph.SchedulingUtils.lambda$scheduleEager$1(SchedulingUtils.java:194) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:633) at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.lambda$new$0(FutureUtils.java:656) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:700) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:484) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:380) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:998) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Thanks, Felipe |
I managed to find what was going wrong. I will write here just for the record. First, the master machine was not login automatically at itself. So I had to give permission for it. chmod og-wx ~/.ssh/authorized_keys chmod 750 $HOME Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal or less the available cores on a single node of the cluster. I am not sure about this parameter, but only after this configuration it worked. Felipe On Fri, Sep 6, 2019 at 10:36 AM Felipe Gutierrez <[hidden email]> wrote:
|
Hi Felipe, I am glad that you were able to fix the problem yourself. > But I suppose that Mesos will allocate Slots and Task Managers dynamically. > Is that right? Yes, that is the case since Flink 1.5 [1]. > Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal or > less the available cores on a single node of the cluster. I am not sure about > this parameter, but only after this configuration it worked. I would need to see JobManager and Mesos logs to understand why this resolved your issue. If you do not set mesos.resourcemanager.tasks.cpus explicitly, Flink will request CPU resources equal to the number of TaskManager slots (taskmanager.numberOfTaskSlots) [2]. Maybe this value was too high in your configuration? Best, Gary [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 [2] https://github.com/apache/flink/blob/0a405251b297109fde1f9a155eff14be4d943887/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L344 On Tue, Sep 10, 2019 at 10:41 AM Felipe Gutierrez <[hidden email]> wrote:
|
Thanks Gary, I am compiling a new version of Mesos and when I test it again I will reply here if I found an error. On Wed, 11 Sep 2019, 09:22 Gary Yao, <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |