I am trying to run Flink using Yarn on MapR. My previous issue got resolved and I have updated the original post accordingly so.
Accordingly, I modified pom.xml to change the zookeeper version to mapr zookeeper jar version which in my case was: 3.4.5-mapr-1604 I then built flink (flink-1.3-SNAPSHOT) as follows: mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.7.0-mapr-1607 The build is successfull. Then I try to run ./bin/yarn-session.sh -n 3 and get the following error: 2017-02-02 16:11:10,717 INFO org.apache.flink.yarn.YarnClusterDescriptor - Using values: 2017-02-02 16:11:10,718 INFO org.apache.flink.yarn.YarnClusterDescriptor - TaskManager count = 3 2017-02-02 16:11:10,718 INFO org.apache.flink.yarn.YarnClusterDescriptor - JobManager memory = 1024 2017-02-02 16:11:10,718 INFO org.apache.flink.yarn.YarnClusterDescriptor - TaskManager memory = 1024 2017-02-02 16:11:10,928 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Process path: null. Event state: SyncConnected. Event type: None 2017-02-02 16:11:10,928 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Connected to ZK: ip-10-101-2-111.ec2.internal:5181,ip-10-101-2-112.ec2.internal:5181,ip-10-101-2-113.ec2.internal:5181 2017-02-02 16:11:10,929 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Getting serviceData for master node of resourcemanager 2017-02-02 16:11:10,935 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Process path: null. Event state: SaslAuthenticated. Event type: None 2017-02-02 16:11:10,948 INFO org.apache.hadoop.yarn.client.MapRZKBasedRMFailoverProxyProvider - Updated RM address to ip-10-101-2-111.ec2.internal/10.101.2.111:8032 2017-02-02 16:11:11,216 WARN org.apache.flink.yarn.YarnClusterDescriptor - The configuration directory ('/home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them. 2017-02-02 16:11:11,225 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/log4j.properties to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/log4j.properties 2017-02-02 16:11:11,249 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/lib 2017-02-02 16:11:11,680 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/logback.xml to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/logback.xml 2017-02-02 16:11:11,685 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib/flink-dist_2.10-1.3-SNAPSHOT.jar to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/flink-dist_2.10-1.3-SNAPSHOT.jar 2017-02-02 16:11:12,932 INFO org.apache.flink.yarn.Utils - Copying from /home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/flink-conf.yaml to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/flink-conf.yaml 2017-02-02 16:11:12,949 INFO org.apache.flink.yarn.YarnClusterDescriptor - Submitting application master application_1485984594262_0007 2017-02-02 16:11:12,977 INFO org.apache.hadoop.yarn.security.ExternalTokenManagerFactory - Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager 2017-02-02 16:11:13,195 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1485984594262_0007 2017-02-02 16:11:13,195 INFO org.apache.flink.yarn.YarnClusterDescriptor - Waiting for the cluster to be allocated 2017-02-02 16:11:13,196 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deploying cluster, current state ACCEPTED Error while deploying YARN cluster: Couldn't deploy Yarn cluster java.lang.RuntimeException: Couldn't deploy Yarn cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:425) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:620) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473) Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. Diagnostics from YARN: Application application_1485984594262_0007 failed 1 times due to AM Container for appattempt_1485984594262_0007_000001 exited with exitCode: 255 For more detailed output, check application tracking page:http://ip-10-101-2-111.ec2.internal:8088/cluster/app/application_1485984594262_0007Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_e05_1485984594262_0007_01_000001 Exit code: 255 Stack trace: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:304) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:354) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:87) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : user is ubuntu main : requested yarn user is ubuntu Container exited with a non-zero exit code 255 Failing this attempt. Failing the application. If log aggregation is enabled on your cluster, use this command to further investigate the issue: yarn logs -applicationId application_1485984594262_0007 at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:888) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:557) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:423) ... 9 more 2017-02-02 16:11:18,231 INFO org.apache.flink.yarn.YarnClusterDescriptor - Cancelling deployment from Deployment Failure Hook 2017-02-02 16:11:18,231 INFO org.apache.flink.yarn.YarnClusterDescriptor - Killing YARN application 2017-02-02 16:11:18,235 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Killed application application_1485984594262_0007 2017-02-02 16:11:18,336 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deleting files in maprfs:/user/ubuntu/.flink/application_1485984594262_0007 So i went ahead and checked the yarn container logs and they have the following error: Uncaught error from thread [flink-akka.remote.default-remote-dispatcher-5] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[flink] java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker signature: (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;) Wrong return type in function at akka.remote.transport.netty.NettyTransport.<init>(NettyTransport.scala:295) at akka.remote.transport.netty.NettyTransport.<init>(NettyTransport.scala:251) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:749) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:741) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:741) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:500) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.remote.EndpointManager.aroundReceive(Remoting.scala:404) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) So, I figured this might be because of clashing netty versions between flink and MapR’s zookeeper jar. And indeed, MapR’s zookeeper jar has following version of netty <groupId>org.jboss.netty</groupId> <artifactId>netty</artifactId> <version>3.2.2.Final</version> And Flink has following: <groupId>io.netty</groupId> <artifactId>netty-all</artifactId> <version>4.0.27.Final</version> So, I changed Flink’s pom.xml to exclude netty from zookeeper dependency. <exclusion> <groupId>org.jboss.netty</groupId> <artifactId>netty</artifactId> </exclusion> Then I again ran ./bin/yarn-session.sh -n 3 and got the following error: 2017-02-02 15:44:03,540 INFO org.apache.flink.yarn.YarnClusterDescriptor - Using values: 2017-02-02 15:44:03,541 INFO org.apache.flink.yarn.YarnClusterDescriptor - TaskManager count = 3 2017-02-02 15:44:03,541 INFO org.apache.flink.yarn.YarnClusterDescriptor - JobManager memory = 1024 2017-02-02 15:44:03,541 INFO org.apache.flink.yarn.YarnClusterDescriptor - TaskManager memory = 1024 2017-02-02 15:44:03,728 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Process path: null. Event state: SyncConnected. Event type: None 2017-02-02 15:44:03,728 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Connected to ZK: ip-10-101-2-111.ec2.internal:5181,ip-10-101-2-112.ec2.internal:5181,ip-10-101-2-113.ec2.internal:5181 2017-02-02 15:44:03,729 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Getting serviceData for master node of resourcemanager 2017-02-02 15:44:03,733 INFO com.mapr.util.zookeeper.ZKDataRetrieval - Process path: null. Event state: SaslAuthenticated. Event type: None 2017-02-02 15:44:03,745 INFO org.apache.hadoop.yarn.client.MapRZKBasedRMFailoverProxyProvider - Updated RM address to ip-10-101-2-111.ec2.internal/10.101.2.111:8032 2017-02-02 15:44:04,016 WARN org.apache.flink.yarn.YarnClusterDescriptor - The configuration directory ('/home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them. 2017-02-02 15:44:04,025 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/lib 2017-02-02 15:44:04,446 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/logback.xml to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/logback.xml 2017-02-02 15:44:04,452 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/log4j.properties to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/log4j.properties 2017-02-02 15:44:04,457 INFO org.apache.flink.yarn.Utils - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib/flink-dist_2.10-1.3-SNAPSHOT.jar to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/flink-dist_2.10-1.3-SNAPSHOT.jar 2017-02-02 15:44:05,826 INFO org.apache.flink.yarn.Utils - Copying from /home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/flink-conf.yaml to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/flink-conf.yaml 2017-02-02 15:44:05,842 INFO org.apache.flink.yarn.YarnClusterDescriptor - Submitting application master application_1485984594262_0006 2017-02-02 15:44:05,870 INFO org.apache.hadoop.yarn.security.ExternalTokenManagerFactory - Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager 2017-02-02 15:44:06,088 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1485984594262_0006 2017-02-02 15:44:06,089 INFO org.apache.flink.yarn.YarnClusterDescriptor - Waiting for the cluster to be allocated 2017-02-02 15:44:06,090 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deploying cluster, current state ACCEPTED Error while deploying YARN cluster: Couldn't deploy Yarn cluster java.lang.RuntimeException: Couldn't deploy Yarn cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:428) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:620) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473) Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. Diagnostics from YARN: Application application_1485984594262_0006 failed 1 times due to AM Container for appattempt_1485984594262_0006_000001 exited with exitCode: 31 For more detailed output, check application tracking page:http://ip-10-101-2-111.ec2.internal:8088/cluster/app/application_1485984594262_0006Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_e05_1485984594262_0006_01_000001 Exit code: 31 Stack trace: ExitCodeException exitCode=31: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:304) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:354) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:87) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : user is ubuntu main : requested yarn user is ubuntu Container exited with a non-zero exit code 31 Failing this attempt. Failing the application. If log aggregation is enabled on your cluster, use this command to further investigate the issue: yarn logs -applicationId application_1485984594262_0006 at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:891) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:560) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:426) ... 9 more 2017-02-02 15:44:12,635 INFO org.apache.flink.yarn.YarnClusterDescriptor - Cancelling deployment from Deployment Failure Hook 2017-02-02 15:44:12,635 INFO org.apache.flink.yarn.YarnClusterDescriptor - Killing YARN application 2017-02-02 15:44:12,641 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Killed application application_1485984594262_0006 2017-02-02 15:44:12,742 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deleting files in maprfs:/user/ubuntu/.flink/application_1485984594262_0006 So i went ahead and checked the yarn container logs again and they have the following error: 2017-02-02 15:44:11,3521 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:580 Thread: 19306 Client initialization failed due to mismatch of libraries. Please make sure that the java library version matches the native build version 5.2.0.39122.GA and native patch version $Id: mapr-version: 5.2.0.39122.GA 40967:64c8e3c8ee67 $ So, is there any way I can resolve this netty conflict to make Flink work on Yarn with MapR? Meanwhile, I am also raising this issue with MapR community. Thanks in advance |
In case anyone is having similar issues with Flink on Yarn on MapR, I managed to solve this issue with help from the MapR community.
|
Thanks for reporting the solution.
@Robert: Is that a general issue we have with Flink YARN on MapR? On 7 February 2017 at 15:50:17, ani.desh1512 ([hidden email]) wrote: > In case anyone is having similar issues with Flink on Yarn on MapR, I > managed to solve > > this issue with help from the MapR community. > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Netty-issues-while-deploying-Flink-with-Yarn-on-MapR-tp11411p11486.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. > |
Hi Aniket, great analysis of the problem! Thank you for looking so well into it! Would you be interested in trying to solve the problem for Flink? We could try to provide a maven build profile that sets the correct versions and excludes. We could maybe also provide a MapR specific release of Flink in the future. I've seen many problems with Flink on MapR recently, and it would be good to fix them "forever". On Tue, Feb 7, 2017 at 4:41 PM, Ufuk Celebi <[hidden email]> wrote: Thanks for reporting the solution. |
Thanks Robert.
I would love to try to solve this problem so that future MapR and Flink users do not face these issues. Should I create a JIRA for it? Let me know how I can be of help. |
Hi, cool! Yes, creating a JIRA for the problem is a good idea. Once you've found a way to fix the issue, you can open a pull request referencing the issue. Regards, Robert On Tue, Feb 7, 2017 at 6:20 PM, ani.desh1512 <[hidden email]> wrote: Thanks Robert. |
Just FYI: There is now a documentation page on how to use Flink on MapR (Thanks to Gordon): https://ci.apache.org/
On Tue, Feb 7, 2017 at 6:34 PM, Robert Metzger <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |