Running Flink 1.0.0 on YARN

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Running Flink 1.0.0 on YARN

Ashutosh Kumar
I have a yarn setup with 1 master and 2 slaves.
    When I run yarn session with  bin/yarn-session.sh -n 2 -jm 1024 -tm 1024 and  submit job with bin/flink run examples/batch/WordCount.jar , the job succeeds . It shows status on yarn UI http://x.x.x.x:8088/cluster . However it does not show anything on Flink UI http://x.x.x.x:8081/#/overview .

Is this expected behavior ?

If I run using bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096  examples/batch/WordCount.jar then the job fails with following error.

      java.lang.IllegalStateException: Update task on instance 451105022ff3b4cd6e2c307e239d1595 @ slave2 - 2 slots - URL: akka.tcp://flink@x.x.x.x:43272/user/taskmanager failed due to:
        at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
        at akka.dispatch.OnFailure.internal(Future.scala:228)
        at akka.dispatch.OnFailure.internal(Future.scala:227)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@
x.x.x.x:43272/user/taskmanager#1361901425]] after [10000 ms]
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
        at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
        at java.lang.Thread.run(Thread.java:745)

03/10/2016 08:36:46     Job execution switched to status FAILING.


Thanks
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: Running Flink 1.0.0 on YARN

rmetzger0
Hi,

the first issue you are describing is expected. Flink is starting the web interface on the container running the JobManager, not on the resource manager.
Also, the port is allocated dynamically, to avoid port collisions. So its not started on 8081.
However, you can access the web interface from the proxy provided in the application overview.

Regarding the second error, can you check the log files of the TaskManager (running on x.x.x.x:43272) which failed?
I'm pretty sure there is some information in there why it didn't respond.


On Thu, Mar 10, 2016 at 9:45 AM, Ashutosh Kumar <[hidden email]> wrote:
I have a yarn setup with 1 master and 2 slaves.
    When I run yarn session with  bin/yarn-session.sh -n 2 -jm 1024 -tm 1024 and  submit job with bin/flink run examples/batch/WordCount.jar , the job succeeds . It shows status on yarn UI http://x.x.x.x:8088/cluster . However it does not show anything on Flink UI http://x.x.x.x:8081/#/overview .

Is this expected behavior ?

If I run using bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096  examples/batch/WordCount.jar then the job fails with following error.

      java.lang.IllegalStateException: Update task on instance 451105022ff3b4cd6e2c307e239d1595 @ slave2 - 2 slots - URL: akka.tcp://flink@x.x.x.x:43272/user/taskmanager failed due to:
        at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
        at akka.dispatch.OnFailure.internal(Future.scala:228)
        at akka.dispatch.OnFailure.internal(Future.scala:227)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@
x.x.x.x:43272/user/taskmanager#1361901425]] after [10000 ms]
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
        at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
        at java.lang.Thread.run(Thread.java:745)

03/10/2016 08:36:46     Job execution switched to status FAILING.


Thanks
Ashutosh