I built a new Flink distribution from release-1.5 branch today. I tried running a job but get this error: java.lang.NoClassDefFoundError: com/sun/jersey/core/util/FeaturesAndProperties I use yarn-cluster mode. The jersey-core jar is found in the hadoop lib on my EMR cluster, but seems like it's not used any more. I checked that jersey-core classes are not included in the new distribution, but they were not included in my previously built flink 1.5-SNAPSHOT either, which works. Has something changed recently to cause this? Is this a Flink bug or should I fix this by somehow explicitly telling Flink YARN app to use the hadoop lib now? More details below if needed. Thanks, Juho My launch command is basically: flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst -yD restart-strategy=fixed-delay -yD restart-strategy.fixed-delay.attempts=3 -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@ I'm also setting this to fix some classloading error (with the previous build that still works) -yD.classloader.resolve-order=parent-first Error stack trace: java.lang.NoClassDefFoundError: com/sun/jersey/core/util/FeaturesAndProperties at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.getClusterDescriptor(FlinkYarnSessionCli.java:971) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescriptor(FlinkYarnSessionCli.java:273) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterDescriptor(FlinkYarnSessionCli.java:449) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterDescriptor(FlinkYarnSessionCli.java:92) at org.apache.fliCommand exiting with ret '31' |
Hi Juho, Can you try submitting with HADOOP_CLASSPATH=`hadoop classpath` set? [1] For example: HADOOP_CLASSPATH=`hadoop classpath` link-${FLINK_VERSION}/bin/flink run [...] Best, Gary On Wed, Mar 28, 2018 at 4:26 PM, Juho Autio <[hidden email]> wrote:
|
Thank you. The YARN job was started now, but the Flink job itself is in some bad state.
Flink UI keeps showing status CREATED for all sub-tasks and nothing seems to be happening. ( For the record, this is what I did: export HADOOP_CLASSPATH=`hadoop classpath` – as found at https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/hadoop.html ) I found this in Job manager log: 2018-03-28 15:26:17,449 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8 at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748) After this there was: 2018-03-28 15:26:17,521 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93). And some time after that: 2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection - GET operation failed java.io.EOFException: Premature end of GET request at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117) Task manager logs don't have any errors. Is that error about BlobServerConnection severe enough to make the job get stuck like this? How to debug this further? Thanks! On Wed, Mar 28, 2018 at 5:56 PM, Gary Yao <[hidden email]> wrote:
|
Never mind, I'll post this new problem as a new thread.
On Wed, Mar 28, 2018 at 6:35 PM, Juho Autio <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |