Job submission timeout with no error info.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Job submission timeout with no error info.

Fakrudeen Ali Ahmed

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

Reply | Threaded
Open this post in threaded view
|

Re: Job submission timeout with no error info.

Andrey Zagrebin-3
Hi Fakrudeen,

which Flink version do you use? could you share full client and job manager logs?

Best,
Andrey

On Fri, Jul 19, 2019 at 7:00 PM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

Reply | Threaded
Open this post in threaded view
|

Re: Job submission timeout with no error info.

Fakrudeen Ali Ahmed

Hi Andrey,

 

 

Flink  version: 1.4.2

Please find the client log attached and job manager log is at: job manager log.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Friday, July 19, 2019 at 10:36 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

which Flink version do you use? could you share full client and job manager logs?

 

Best,

Andrey

 

On Fri, Jul 19, 2019 at 7:00 PM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 


quality-eval-topology.txt (84K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Job submission timeout with no error info.

Andrey Zagrebin-3
Hi Fakrudeen,

Thanks for sharing the logs. Could you also try it with Flink 1.8?

Best,
Andrey

On Sat, Jul 20, 2019 at 12:44 AM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi Andrey,

 

 

Flink  version: 1.4.2

Please find the client log attached and job manager log is at: job manager log.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Friday, July 19, 2019 at 10:36 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

which Flink version do you use? could you share full client and job manager logs?

 

Best,

Andrey

 

On Fri, Jul 19, 2019 at 7:00 PM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

Reply | Threaded
Open this post in threaded view
|

Re: Job submission timeout with no error info.

Fakrudeen Ali Ahmed

Thanks Andrey.

The environment we run [Azure HD insight cluster] only supports Flink 1.4.2 now. So I can’t run with 1.8 in this environment.

I can run in a different environment with 1.8 [on Kubernetes not YARN though] and report the results.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Monday, July 22, 2019 at 8:52 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

Thanks for sharing the logs. Could you also try it with Flink 1.8?

 

Best,

Andrey

 

On Sat, Jul 20, 2019 at 12:44 AM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi Andrey,

 

 

Flink  version: 1.4.2

Please find the client log attached and job manager log is at: job manager log.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Friday, July 19, 2019 at 10:36 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

which Flink version do you use? could you share full client and job manager logs?

 

Best,

Andrey

 

On Fri, Jul 19, 2019 at 7:00 PM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

Reply | Threaded
Open this post in threaded view
|

Re: Job submission timeout with no error info.

Fakrudeen Ali Ahmed

It turns out the actual issue was a configuration issue and we just had to pore over job manager log carefully. We were using HDFS  [really API on top of windows blob] as source and we didn’t provide the server location and it took the path prefix as the server.

Only thing here would have been Flink returning better error message instead of simply timing out.

 

Thanks Andrey for the help!

 

[flink-akka.actor.default-dispatcher-4] ERROR org.apache.hadoop.fs.azure.AzureNativeFileSystemStore - Service returned StorageException when checking existence of container $root in account s3[flink-akka.actor.default-dispatcher-4] ERROR org.apache.hadoop.fs.azure.AzureNativeFileSystemStore - Service returned StorageException when checking existence of container $root in account s3com.microsoft.azure.storage.StorageException:  at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87) at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:209) at com.microsoft.azure.storage.blob.CloudBlobContainer.exists(CloudBlobContainer.java:769) at com.microsoft.azure.storage.blob.CloudBlobContainer.exists(CloudBlobContainer.java:756) at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobContainerWrapperImpl.exists(StorageInterfaceImpl.java:233) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:860) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1085) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:540) at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1354) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2796) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:265) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:150) at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:58) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:248) at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:810) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:180) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1277) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:447) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:107) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.net.UnknownHostException: s3 at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:115) ... 46 more

 

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Fakrudeen Ali Ahmed <[hidden email]>
Date: Monday, July 22, 2019 at 9:08 AM
To: Andrey Zagrebin <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Thanks Andrey.

The environment we run [Azure HD insight cluster] only supports Flink 1.4.2 now. So I can’t run with 1.8 in this environment.

I can run in a different environment with 1.8 [on Kubernetes not YARN though] and report the results.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Monday, July 22, 2019 at 8:52 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

Thanks for sharing the logs. Could you also try it with Flink 1.8?

 

Best,

Andrey

 

On Sat, Jul 20, 2019 at 12:44 AM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi Andrey,

 

 

Flink  version: 1.4.2

Please find the client log attached and job manager log is at: job manager log.

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))

 

 

From: Andrey Zagrebin <[hidden email]>
Date: Friday, July 19, 2019 at 10:36 AM
To: Fakrudeen Ali Ahmed <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: Job submission timeout with no error info.

 

Hi Fakrudeen,

 

which Flink version do you use? could you share full client and job manager logs?

 

Best,

Andrey

 

On Fri, Jul 19, 2019 at 7:00 PM Fakrudeen Ali Ahmed <[hidden email]> wrote:

Hi,

 

We are submitting a Flink topology [YARN] and it fails during upload of the jar with no error info.

 

[main] INFO org.apache.flink.runtime.client.JobClient - Checking and uploading JAR files

[main] ERROR org.apache.flink.client.CliFrontend - Error while running the command.

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms

 

 

Flink UI says:

“Could not retrieve the redirect address of the current leader. Please try to refresh.”

 

We tried increasing job manager memory to 8GB and still it has the same issue. Jar size is around 190 MB but we were able to successfully run this size before. Zookeeper in Hadoop YARN cluster is healthy.

 

How do we start debugging this? Is it some dependency jar issue in our uber jar or something else?

 

Thanks,

-Fakrudeen

(define (sqrte n xn eph) (if (> eph (abs (- n (* xn xn)))) xn (sqrte n (/ (+ xn (/ n xn)) 2) eph)))