|
On Tue, Feb 26, 2019 at 7:13 PM Richard Deurwaarder < [hidden email]> wrote: Hello Gary,
Thank you for your response.
I'd like to use the new mode but it does not work for me. It seems I am running into a firewall issue.
Because the rest.port is random when running on yarn[1]. The machine I use to deploy the job can, in fact, start the Flink cluster, but it cannot submit the job on the random chosen port because our firewall blocks it.
Do you know if this is still the case on 1.7 and if there is any way to work around this?
Richard
On Mon, Feb 18, 2019 at 12:00 PM Richard Deurwaarder < [hidden email]> wrote: Hello,
I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job.
The flink job runs on our hadoop cluster and starts using Yarn.
Relevant config options seem to be:
jobmanager.rpc.port: 55501
recovery.jobmanager.port: 55502
yarn.application-master.port: 55503
blob.server.port: 55504
I've seen the following behavior: - Using the same flink-conf.yaml as we used in 1.4.2: 1.5.6 / 1.6.3 / 1.7.1 all versions timeout while 1.4.2 works. - Using 1.5.6 with "mode: legacy" (to switch off flip-6) works
When the timeout happens I get the following stacktrace:
INFO | class java.time.Instant does not contain a getter for field seconds | 2019-02-18T10:16:56.815+01:00 | INFO | class com.bol.fin_hdp.cm1.domain.Cm1Transportable does not contain a getter for field globalId | 2019-02-18T10:16:56.815+01:00 | INFO | Submitting job 5af931bcef395a78b5af2b97e92dcffe (detached: false). | 2019-02-18T10:16:57.182+01:00 | INFO | ------------------------------------------------------------ | 2019-02-18T10:29:27.527+01:00 | INFO | The program finished with the following exception: | 2019-02-18T10:29:27.564+01:00 | INFO | org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. | 2019-02-18T10:29:27.601+01:00 | INFO | at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545) | 2019-02-18T10:29:27.638+01:00 | INFO | at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420) | 2019-02-18T10:29:27.675+01:00 | INFO | at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404) | 2019-02-18T10:29:27.711+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798) | 2019-02-18T10:29:27.747+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289) | 2019-02-18T10:29:27.784+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215) | 2019-02-18T10:29:27.820+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035) | 2019-02-18T10:29:27.857+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111) | 2019-02-18T10:29:27.893+01:00 | INFO | at java.security.AccessController.doPrivileged(Native Method) | 2019-02-18T10:29:27.929+01:00 |
INFO | at javax.security.auth.Subject.doAs(Subject.java:422) | 2019-02-18T10:29:27.968+01:00 | INFO | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) | 2019-02-18T10:29:28.004+01:00 | INFO | at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) | 2019-02-18T10:29:28.040+01:00 | INFO | at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111) | 2019-02-18T10:29:28.075+01:00 | INFO | Caused by: java.lang.RuntimeException: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. | 2019-02-18T10:29:28.110+01:00 | INFO | at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:43) | 2019-02-18T10:29:28.146+01:00 | INFO | at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJobWithConfig(IntervalJobStarter.java:32) | 2019-02-18T10:29:28.182+01:00 | INFO | at com.bol.fin_hdp.Main.main(Main.java:8) | 2019-02-18T10:29:28.217+01:00 | INFO | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) | 2019-02-18T10:29:28.253+01:00 | INFO | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) | 2019-02-18T10:29:28.289+01:00 | INFO | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) | 2019-02-18T10:29:28.325+01:00 | INFO | at java.lang.reflect.Method.invoke(Method.java:498) | 2019-02-18T10:29:28.363+01:00 | INFO | at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528) | 2019-02-18T10:29:28.400+01:00 | INFO | ... 12 more | 2019-02-18T10:29:28.436+01:00 | INFO | Caused by: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. | 2019-02-18T10:29:28.473+01:00 | INFO | at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258) | 2019-02-18T10:29:28.509+01:00 | INFO | at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464) | 2019-02-18T10:29:28.544+01:00 | INFO | at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) | 2019-02-18T10:29:28.581+01:00 | INFO | at com.bol.fin_hdp.cm1.job.Job.execute(Job.java:54) | 2019-02-18T10:29:28.617+01:00 | INFO | at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:41) | 2019-02-18T10:29:28.654+01:00 | INFO | ... 19 more | 2019-02-18T10:29:28.693+01:00 | INFO | Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph. | 2019-02-18T10:29:28.730+01:00 | INFO | at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371) | 2019-02-18T10:29:28.766+01:00 | INFO | at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) | 2019-02-18T10:29:28.803+01:00 | INFO | at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) | 2019-02-18T10:29:28.839+01:00 | INFO | at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) | 2019-02-18T10:29:28.876+01:00 | INFO | at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) | 2019-02-18T10:29:28.912+01:00 | INFO | at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:216) | 2019-02-18T10:29:28.948+01:00 | INFO | at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) | 2019-02-18T10:29:28.986+01:00 | INFO | at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) | 2019-02-18T10:29:29.023+01:00 | INFO | at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) | 2019-02-18T10:29:29.060+01:00 | INFO | at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) | 2019-02-18T10:29:29.096+01:00 | INFO | at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:301) | 2019-02-18T10:29:29.133+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) | 2019-02-18T10:29:29.169+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) | 2019-02-18T10:29:29.206+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) | 2019-02-18T10:29:29.242+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) | 2019-02-18T10:29:29.278+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214) | 2019-02-18T10:29:29.315+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) | 2019-02-18T10:29:29.352+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) | 2019-02-18T10:29:29.388+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) | 2019-02-18T10:29:29.424+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) | 2019-02-18T10:29:29.460+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) | 2019-02-18T10:29:29.496+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) | 2019-02-18T10:29:29.532+01:00 | INFO | at java.lang.Thread.run(Thread.java:748) | 2019-02-18T10:29:29.569+01:00 | INFO | Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. | 2019-02-18T10:29:29.606+01:00 | INFO | at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213) | 2019-02-18T10:29:29.643+01:00 | INFO | ... 17 more | 2019-02-18T10:29:29.680+01:00 | INFO | Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-01... | 2019-02-18T10:29:29.717+01:00 | INFO | at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) | 2019-02-18T10:29:29.753+01:00 | INFO | at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) | 2019-02-18T10:29:29.789+01:00 | INFO | at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) | 2019-02-18T10:29:29.826+01:00 | INFO | at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) | 2019-02-18T10:29:29.862+01:00 | INFO | ... 15 more | 2019-02-18T10:29:29.898+01:00 | INFO | Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-017.example.com/some.ip.address:46500 | 2019-02-18T10:29:29.934+01:00 | INFO | at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212) | 2019-02-18T10:29:29.970+01:00 | INFO | ... 7 more |
Does anyone have tips how to debug this or what configuration changes I need to make?
|