ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Chan, Regina

Hi,

 

I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There’s plenty of space left in my cluster and on that node. It’s not clear to me what’s happening. Any pointers?

 

On the client side, when I try to execute, I see the following:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager.

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)

        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)

        at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

        at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager.

        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)

        at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)

        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)

        at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)

        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)

        ... 9 more

Caused by: java.io.IOException: PUT operation failed: Connection reset

        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)

        at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)

        ... 11 more

Caused by: java.net.SocketException: Connection reset

        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)

        ... 14 more

 

 

On the job manager logs I see this:

 

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

 

 

 

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

Reply | Threaded
Open this post in threaded view
|

RE: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Chan, Regina

And if it helps, I’m running on flink 1.2.1. I saw this ticket: https://issues.apache.org/jira/browse/FLINK-5828 It only started happening when I was running all 50 flows at the same time. However, it looks like it’s not an issue with creating the cache directory but with running out of space there? But what’s in there is also tiny.

 

bash-4.1$ hdfs dfs -du -h hdfs://d191291/user/delp/.flink/application_1510733430616_2098853

1.1 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml

1.4 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml

93.5 M   hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar

264.8 M  hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib

1.9 K    hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties

 

 

From: Chan, Regina [Tech]
Sent: Tuesday, December 12, 2017 1:56 AM
To: '[hidden email]'
Subject: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

 

Hi,

 

I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There’s plenty of space left in my cluster and on that node. It’s not clear to me what’s happening. Any pointers?

 

On the client side, when I try to execute, I see the following:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager.

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)

        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)

        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)

        at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

        at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)

        at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager.

        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)

        at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)

        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)

        at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)

        at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)

        ... 9 more

Caused by: java.io.IOException: PUT operation failed: Connection reset

        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)

        at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)

        at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)

        ... 11 more

Caused by: java.net.SocketException: Connection reset

        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)

        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)

        at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)

        ... 14 more

 

 

On the job manager logs I see this:

 

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

        at java.io.FileOutputStream.writeBytes(Native Method)

        at java.io.FileOutputStream.write(FileOutputStream.java:345)

        at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)

        at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)

2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection            - PUT operation failed

java.io.IOException: No space left on device

 

 

 

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

Reply | Threaded
Open this post in threaded view
|

Re: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

Nico Kruber
Hi Regina,
judging from the exception you posted, this is not about storing the
file in HDFS, but a step before that where the BlobServer first puts the
incoming file into its local file system in the directory given by the
`blob.storage.directory` configuration property. If this property is not
set or empty, it will fall back to `java.io.tmpdir`. The BlobServer
creates a subdirectory `blobStore-<UUID>` and put incoming files into
`<storage-dir>/blobStore-<UUID>/incoming` with file names
`temp-12345678` (using an atomic file counter). It seems that there is
no space left in the filesystem of this directory.

If you set the log level to INFO, you should see a message like "Created
BLOB server storage directory ..." with the path. Can you double check
whether there is really no space left there?


Nico

On 12/12/17 08:02, Chan, Regina wrote:

> And if it helps, I’m running on flink 1.2.1. I saw this ticket:
> https://issues.apache.org/jira/browse/FLINK-5828 It only started
> happening when I was running all 50 flows at the same time. However, it
> looks like it’s not an issue with creating the cache directory but with
> running out of space there? But what’s in there is also tiny.
>
>  
>
> bash-4.1$ hdfs dfs -du -h
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853
>
> 1.1 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml
>
> 1.4 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml
>
> 93.5 M  
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar
>
> 264.8 M 
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib
>
> 1.9 K   
> hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties
>
>  
>
>  
>
> *From:*Chan, Regina [Tech]
> *Sent:* Tuesday, December 12, 2017 1:56 AM
> *To:* '[hidden email]'
> *Subject:* ProgramInvocationException: Could not upload the jar files to
> the job manager / No space left on device
>
>  
>
> Hi,
>
>  
>
> I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each
> job has 1 parallelism. There’s plenty of space left in my cluster and on
> that node. It’s not clear to me what’s happening. Any pointers?
>
>  
>
> On the client side, when I try to execute, I see the following:
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: Could not upload the jar files to the job manager.
>
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>
>         at
> org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
>
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
>
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
>
>         at
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>
>         at
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
>
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143)
>
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107)
>
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72)
>
>         at
> com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could
> not upload the jar files to the job manager.
>
>         at
> org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150)
>
>         at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
>
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>
>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Caused by: java.io.IOException: Could not retrieve the JobManager's blob
> port.
>
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745)
>
>         at
> org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565)
>
>         at
> org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148)
>
>         ... 9 more
>
> Caused by: java.io.IOException: PUT operation failed: Connection reset
>
>         at
> org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512)
>
>         at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374)
>
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771)
>
>         at
> org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740)
>
>         ... 11 more
>
> Caused by: java.net.SocketException: Connection reset
>
>         at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
>
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>
>         at
> org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499)
>
>         ... 14 more
>
>  
>
>  
>
> On the job manager logs I see this:
>
>  
>
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
>
> java.io.IOException: No space left on device
>
>         at java.io.FileOutputStream.writeBytes(Native Method)
>
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
>
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
>
> java.io.IOException: No space left on device
>
>         at java.io.FileOutputStream.writeBytes(Native Method)
>
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
>
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
>
> java.io.IOException: No space left on device
>
>         at java.io.FileOutputStream.writeBytes(Native Method)
>
>         at java.io.FileOutputStream.write(FileOutputStream.java:345)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314)
>
>         at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113)
>
> 2017-12-12 01:42:47,608 ERROR
> org.apache.flink.runtime.blob.BlobServerConnection            - PUT
> operation failed
>
> java.io.IOException: No space left on device
>
>  
>
>  
>
>  
>
>  
>
> *Regina Chan*
>
> *Goldman Sachs**–*Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302*(  (212) 902-5697**
>
>  
>


signature.asc (201 bytes) Download Attachment