Hi, I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There’s plenty of space left in my cluster and on that node. It’s not clear
to me what’s happening. Any pointers? On the client side, when I try to execute, I see the following: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926) at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143) at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager. at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150) at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.io.IOException: Could not retrieve the JobManager's blob port. at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745) at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565) at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148) ... 9 more Caused by: java.io.IOException: PUT operation failed: Connection reset at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512) at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740) ... 11 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499) ... 14 more On the job manager logs I see this: 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device Regina Chan Goldman Sachs
–
Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302
( (212) 902-5697 |
And if it helps, I’m running on flink 1.2.1. I saw this ticket:
https://issues.apache.org/jira/browse/FLINK-5828 It only started happening when I was running all 50 flows at the same time. However, it looks like it’s not an issue with creating the cache directory
but with running out of space there? But what’s in there is also tiny. bash-4.1$ hdfs dfs -du -h hdfs://d191291/user/delp/.flink/application_1510733430616_2098853 1.1 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml 1.4 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml 93.5 M hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar 264.8 M hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib 1.9 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties From: Chan, Regina [Tech]
Hi, I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There’s plenty of space left in my cluster and on that
node. It’s not clear to me what’s happening. Any pointers? On the client side, when I try to execute, I see the following: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926) at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143) at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager. at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150) at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.io.IOException: Could not retrieve the JobManager's blob port. at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745) at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565) at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148) ... 9 more Caused by: java.io.IOException: PUT operation failed: Connection reset at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512) at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740) ... 11 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499) ... 14 more On the job manager logs I see this: 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device Regina Chan Goldman Sachs
– Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302
( (212) 902-5697 |
Hi Regina,
judging from the exception you posted, this is not about storing the file in HDFS, but a step before that where the BlobServer first puts the incoming file into its local file system in the directory given by the `blob.storage.directory` configuration property. If this property is not set or empty, it will fall back to `java.io.tmpdir`. The BlobServer creates a subdirectory `blobStore-<UUID>` and put incoming files into `<storage-dir>/blobStore-<UUID>/incoming` with file names `temp-12345678` (using an atomic file counter). It seems that there is no space left in the filesystem of this directory. If you set the log level to INFO, you should see a message like "Created BLOB server storage directory ..." with the path. Can you double check whether there is really no space left there? Nico On 12/12/17 08:02, Chan, Regina wrote: > And if it helps, I’m running on flink 1.2.1. I saw this ticket: > https://issues.apache.org/jira/browse/FLINK-5828 It only started > happening when I was running all 50 flows at the same time. However, it > looks like it’s not an issue with creating the cache directory but with > running out of space there? But what’s in there is also tiny. > > > > bash-4.1$ hdfs dfs -du -h > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853 > > 1.1 K > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml > > 1.4 K > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml > > 93.5 M > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar > > 264.8 M > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib > > 1.9 K > hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties > > > > > > *From:*Chan, Regina [Tech] > *Sent:* Tuesday, December 12, 2017 1:56 AM > *To:* '[hidden email]' > *Subject:* ProgramInvocationException: Could not upload the jar files to > the job manager / No space left on device > > > > Hi, > > > > I’m currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each > job has 1 parallelism. There’s plenty of space left in my cluster and on > that node. It’s not clear to me what’s happening. Any pointers? > > > > On the client side, when I try to execute, I see the following: > > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed: Could not upload the jar files to the job manager. > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) > > at > org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400) > > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387) > > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) > > at > org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926) > > at > com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143) > > at > com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107) > > at > com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72) > > at > com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could > not upload the jar files to the job manager. > > at > org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150) > > at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95) > > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: java.io.IOException: Could not retrieve the JobManager's blob > port. > > at > org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745) > > at > org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565) > > at > org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148) > > ... 9 more > > Caused by: java.io.IOException: PUT operation failed: Connection reset > > at > org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512) > > at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374) > > at > org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771) > > at > org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740) > > ... 11 more > > Caused by: java.net.SocketException: Connection reset > > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) > > at java.net.SocketOutputStream.write(SocketOutputStream.java:159) > > at > org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499) > > ... 14 more > > > > > > On the job manager logs I see this: > > > > 2017-12-12 01:42:47,608 ERROR > org.apache.flink.runtime.blob.BlobServerConnection - PUT > operation failed > > java.io.IOException: No space left on device > > at java.io.FileOutputStream.writeBytes(Native Method) > > at java.io.FileOutputStream.write(FileOutputStream.java:345) > > at > org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) > > at > org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) > > 2017-12-12 01:42:47,608 ERROR > org.apache.flink.runtime.blob.BlobServerConnection - PUT > operation failed > > java.io.IOException: No space left on device > > at java.io.FileOutputStream.writeBytes(Native Method) > > at java.io.FileOutputStream.write(FileOutputStream.java:345) > > at > org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) > > at > org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) > > 2017-12-12 01:42:47,608 ERROR > org.apache.flink.runtime.blob.BlobServerConnection - PUT > operation failed > > java.io.IOException: No space left on device > > at java.io.FileOutputStream.writeBytes(Native Method) > > at java.io.FileOutputStream.write(FileOutputStream.java:345) > > at > org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) > > at > org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) > > 2017-12-12 01:42:47,608 ERROR > org.apache.flink.runtime.blob.BlobServerConnection - PUT > operation failed > > java.io.IOException: No space left on device > > > > > > > > > > *Regina Chan* > > *Goldman Sachs**–*Enterprise Platforms, Data Architecture > > *30 Hudson Street, 37th floor | Jersey City, NY 07302*( (212) 902-5697** > > > signature.asc (201 bytes) Download Attachment |
Free forum by Nabble | Edit this page |