On jobManager instance, everything works fine till the job is switched from DEPLOYING to RUNNING. Post that, once akka-timeut expires, I see the following stacktrace
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-177004106]] after [100000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
I went through the flink code on github and all the steps required to execute a job seems to be running fine. However, when jobManager has to give job submission ack to flink client that triggered the job, the jobSubmitHandler times out on the akka dispatcher that according to my understanding takes care of communicating with the job client.
The Flink job consists for 1 Source (kafka), 2 operators and 1 sink(Custom Sink). Following link shows the jobManager logs: https://pastebin.com/raw/3GaTtNrG
Once the dispatcher times out, all other Flink UI calls also timeout with same exception.
Following are the flink client logs that is used to submit job via command line.
2019-09-28 19:34:21,321 INFO org.apache.flink.client.cli.CliFrontend - --------------------------------------------------------------------------------
2019-09-28 19:34:21,322 INFO org.apache.flink.client.cli.CliFrontend - Starting Command Line Client (Version: 1.8.0, Rev:<unknown>, Date:<unknown>)
2019-09-28 19:34:21,322 INFO org.apache.flink.client.cli.CliFrontend - OS current user: root
2019-09-28 19:34:21,322 INFO org.apache.flink.client.cli.CliFrontend - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-09-28 19:34:21,322 INFO org.apache.flink.client.cli.CliFrontend - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.5-b02
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - Maximum heap size: 2677 MiBytes
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - JAVA_HOME: (not set)
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - No Hadoop Dependency available
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - JVM Options:
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - -Dlog.file=/var/lib/fulfillment-stream-processor/flink-executables/flink-executables/log/flink-root-client-fulfillment-stream-processor-flink-task-manager-2-8047357.log
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - -Dlog4j.configuration=file:/var/lib/fulfillment-stream-processor/flink-executables/flink-executables/conf/log4j-cli.properties
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - -Dlogback.configurationFile=file:/var/lib/fulfillment-stream-processor/flink-executables/flink-executables/conf/logback.xml
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - Program Arguments:
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - run
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - -d
2019-09-28 19:34:21,323 INFO org.apache.flink.client.cli.CliFrontend - -c
2019-09-28 19:34:21,324 INFO org.apache.flink.client.cli.CliFrontend - /home/fse/flink-kafka-relayer-0.2.jar
2019-09-28 19:34:21,324 INFO org.apache.flink.client.cli.CliFrontend - Classpath: /var/lib/fulfillment-stream-processor/flink-executables/flink-executables/lib/log4j-1.2.17.jar:/var/lib/fulfillment-stream-processor/flink-executables/flink-executables/lib/slf4j-log4j12-1.7.15.jar:/var/lib/fulfillment-stream-processor/flink-executables/flink-executables/lib/flink-dist_2.11-1.8.0.jar:::
2019-09-28 19:34:21,324 INFO org.apache.flink.client.cli.CliFrontend - --------------------------------------------------------------------------------
2019-09-28 19:34:21,328 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, <job-manager-ip>
2019-09-28 19:34:21,328 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-09-28 19:34:21,328 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-09-28 19:34:21,329 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-09-28 19:34:21,329 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2019-09-28 19:34:21,329 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-09-28 19:34:21,329 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.jmx.class, org.apache.flink.metrics.jmx.JMXReporter
2019-09-28 19:34:21,329 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.jmx.port, 8789
2019-09-28 19:34:21,333 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/exceptions/YarnException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1230)
at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1190)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1115)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.exceptions.YarnException
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
2019-09-28 19:34:21,343 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-09-28 19:34:21,545 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-09-28 19:34:21,560 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-09-28 19:34:21,561 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command.
2019-09-28 19:34:21,566 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file
2019-09-28 19:34:21,744 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address'
2019-09-28 19:34:21,896 INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started.
2019-09-28 19:34:21,898 INFO org.apache.flink.client.cli.CliFrontend - Starting execution of program
2019-09-28 19:34:21,898 INFO org.apache.flink.client.program.rest.RestClusterClient - Starting program in interactive mode (detached: true)
2019-09-28 19:34:22,594 WARN org.apache.flink.streaming.api.environment.StreamContextEnvironment - Job was executed in detached mode, the results will be available on completion.
2019-09-28 19:34:22,632 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, <job-manager-ip>
2019-09-28 19:34:22,632 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-09-28 19:34:22,632 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-09-28 19:34:22,633 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-09-28 19:34:22,633 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2019-09-28 19:34:22,633 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-09-28 19:34:22,633 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.jmx.class, org.apache.flink.metrics.jmx.JMXReporter
2019-09-28 19:34:22,633 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.jmx.port, 8789
2019-09-28 19:34:22,635 INFO org.apache.flink.client.program.rest.RestClusterClient - Submitting job f839aefee74aa4483ce8f8fd2e49b69e (detached: true).
2019-09-28 19:36:04,341 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.
2019-09-28 19:36:04,343 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.
2019-09-28 19:36:04,343 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: f839aefee74aa4483ce8f8fd2e49b69e)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at org.apache.flink.client.cli.CliFrontend$$Lambda$5/1971851377.call(Unknown Source)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:388)
at org.apache.flink.client.program.rest.RestClusterClient$$Lambda$17/788892554.apply(Unknown Source)
at java.util.concurrent.CompletableFuture$ExceptionCompletion.run(CompletableFuture.java:1246)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
at java.util.concurrent.CompletableFuture.internalComplete(CompletableFuture.java:210)
at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:723)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
at java.util.concurrent.CompletableFuture.internalComplete(CompletableFuture.java:210)
at java.util.concurrent.CompletableFuture$ThenCopy.run(CompletableFuture.java:1333)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2361)
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:207)
at org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$34/1092254958.accept(Unknown Source)
at java.util.concurrent.CompletableFuture$WhenCompleteCompletion.run(CompletableFuture.java:1298)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
at java.util.concurrent.CompletableFuture.internalComplete(CompletableFuture.java:210)
at java.util.concurrent.CompletableFuture$AsyncCompose.exec(CompletableFuture.java:626)
at java.util.concurrent.CompletableFuture$Async.run(CompletableFuture.java:428)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal server error., <Exception on server side:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-177004106]] after [100000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
End of exception on server side>]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
at org.apache.flink.runtime.rest.RestClient$$Lambda$33/1155836850.apply(Unknown Source)
at java.util.concurrent.CompletableFuture$AsyncCompose.exec(CompletableFuture.java:604)
... 4 more
I have turned on debug logs for flink, akka and kafka but not able to figure out what is going wrong. I have very basic understanding of akka because of which not able to figure out what is going wrong. Can someone help me with that?? I am running flink 1.8.0.