No space left on device exception

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

No space left on device exception

Vishwas Siravara
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 
Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Piotr Nowojski-4
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 
Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Vishwas Siravara
Hi Piotr,
Thank you for your suggestion. I will try that, are the temporary files created in the directory set in io.tmp.dirs in the flink-conf.yaml ? Would these files be the same size as checkpoints ?


Thanks,
Vishwas 

On Thu, Aug 20, 2020 at 8:35 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 
Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Vishwas Siravara
Hi Piotr,
I did some analysis and realised that the temp files for s3 checkpoints are staged in /tmp although the  io.tmp.dirs is set to a different directory. 

ls -lrth
drwxr-xr-x. 2 was      was     32 Aug 20 17:52 hsperfdata_was
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-8158855975833379228.tmp
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-7048419193714606532.tmp
drwxr-xr--. 2 root     root     6 Aug 20 18:46 hsperfdata_root
[was@sl73rspapd031 tmp]$ 
flink-conf.yaml configuration 
io.tmp.dirs: /usr/mware/flink/tmp

The /tmp has only 2GB, is it possible to change the staging directory for s3 checkpoints ?

Best,
Vishwas 

On Thu, Aug 20, 2020 at 10:27 AM Vishwas Siravara <[hidden email]> wrote:
Hi Piotr,
Thank you for your suggestion. I will try that, are the temporary files created in the directory set in io.tmp.dirs in the flink-conf.yaml ? Would these files be the same size as checkpoints ?


Thanks,
Vishwas 

On Thu, Aug 20, 2020 at 8:35 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 
Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Chesnay Schepler
Could you try adding this to your flink-conf.yaml?

s3.staging-directory: /usr/mware/flink/tmp

On 20/08/2020 20:50, Vishwas Siravara wrote:
Hi Piotr,
I did some analysis and realised that the temp files for s3 checkpoints are staged in /tmp although the  io.tmp.dirs is set to a different directory. 

ls -lrth
drwxr-xr-x. 2 was      was     32 Aug 20 17:52 hsperfdata_was
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-8158855975833379228.tmp
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-7048419193714606532.tmp
drwxr-xr--. 2 root     root     6 Aug 20 18:46 hsperfdata_root
[was@sl73rspapd031 tmp]$ 
flink-conf.yaml configuration 
io.tmp.dirs: /usr/mware/flink/tmp
The /tmp has only 2GB, is it possible to change the staging directory for s3 checkpoints ?

Best,
Vishwas 

On Thu, Aug 20, 2020 at 10:27 AM Vishwas Siravara <[hidden email]> wrote:
Hi Piotr,
Thank you for your suggestion. I will try that, are the temporary files created in the directory set in io.tmp.dirs in the flink-conf.yaml ? Would these files be the same size as checkpoints ?


Thanks,
Vishwas 

On Thu, Aug 20, 2020 at 8:35 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 


Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Vishwas Siravara
Thank you Chesnay.
Yes but I could change the staging directory by adding -Djava.io.tmpdir=/data/flink-1.7.2/tmp to env.java.opts in the flink-conf.yaml file. Do you see any problem with that? 

Best,
Vishwas 

On Thu, Aug 20, 2020 at 2:01 PM Chesnay Schepler <[hidden email]> wrote:
Could you try adding this to your flink-conf.yaml?

s3.staging-directory: /usr/mware/flink/tmp

On 20/08/2020 20:50, Vishwas Siravara wrote:
Hi Piotr,
I did some analysis and realised that the temp files for s3 checkpoints are staged in /tmp although the  io.tmp.dirs is set to a different directory. 

ls -lrth
drwxr-xr-x. 2 was      was     32 Aug 20 17:52 hsperfdata_was
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-8158855975833379228.tmp
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-7048419193714606532.tmp
drwxr-xr--. 2 root     root     6 Aug 20 18:46 hsperfdata_root
[was@sl73rspapd031 tmp]$ 
flink-conf.yaml configuration 
io.tmp.dirs: /usr/mware/flink/tmp
The /tmp has only 2GB, is it possible to change the staging directory for s3 checkpoints ?

Best,
Vishwas 

On Thu, Aug 20, 2020 at 10:27 AM Vishwas Siravara <[hidden email]> wrote:
Hi Piotr,
Thank you for your suggestion. I will try that, are the temporary files created in the directory set in io.tmp.dirs in the flink-conf.yaml ? Would these files be the same size as checkpoints ?


Thanks,
Vishwas 

On Thu, Aug 20, 2020 at 8:35 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas 


Reply | Threaded
Open this post in threaded view
|

Re: No space left on device exception

Chesnay Schepler
That should work as well.

On 20/08/2020 22:46, Vishwas Siravara wrote:
Thank you Chesnay.
Yes but I could change the staging directory by adding -Djava.io.tmpdir=/data/flink-1.7.2/tmp to env.java.opts in the flink-conf.yaml file. Do you see any problem with that? 

Best,
Vishwas 

On Thu, Aug 20, 2020 at 2:01 PM Chesnay Schepler <[hidden email]> wrote:
Could you try adding this to your flink-conf.yaml?

s3.staging-directory: /usr/mware/flink/tmp

On 20/08/2020 20:50, Vishwas Siravara wrote:
Hi Piotr,
I did some analysis and realised that the temp files for s3 checkpoints are staged in /tmp although the  io.tmp.dirs is set to a different directory. 

ls -lrth
drwxr-xr-x. 2 was      was     32 Aug 20 17:52 hsperfdata_was
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-8158855975833379228.tmp
-rw-------. 1 was      was   505M Aug 20 18:45 presto-s3-7048419193714606532.tmp
drwxr-xr--. 2 root     root     6 Aug 20 18:46 hsperfdata_root
[was@sl73rspapd031 tmp]$ 
flink-conf.yaml configuration 
io.tmp.dirs: /usr/mware/flink/tmp
The /tmp has only 2GB, is it possible to change the staging directory for s3 checkpoints ?

Best,
Vishwas 

On Thu, Aug 20, 2020 at 10:27 AM Vishwas Siravara <[hidden email]> wrote:
Hi Piotr,
Thank you for your suggestion. I will try that, are the temporary files created in the directory set in io.tmp.dirs in the flink-conf.yaml ? Would these files be the same size as checkpoints ?


Thanks,
Vishwas 

On Thu, Aug 20, 2020 at 8:35 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

As far as I know when uploading a file to S3, the writer needs to first create some temporary files on the local disks. I would suggest to double check all of the partitions on the local machine and monitor available disk space continuously while the job is running. If you are just checking the free space manually, you can easily miss a point of time when you those temporary files are too big and approaching the available disk space usage, as I'm pretty sure those temporary files are cleaned up immediately after throwing this exception that you see.

Piotrek

czw., 20 sie 2020 o 00:56 Vishwas Siravara <[hidden email]> napisał(a):
Hi guys,
I have a deduplication job that runs on flink 1.7, that has some state which uses FsState backend. My TM heap size is 16 GB. I see the below error while trying to checkpoint a state of size 2GB. There is enough space available in s3, I tried to upload larger files and they were all successful. There is also enough disk space in the local file system, the disk utility tool does not show anything suspicious. Whenever I try to start my job from the last successful checkpoint , it runs into the same error. Can someone tell me what is the cause of this issue? Many thanks. 


Note: This error goes away when I delete io.tmp.dirs and restart the job from last checkpoint , but the disk utility tool does not show much usage before deletion, so I am not able to figure out what the problem is. 

2020-08-19 21:12:01,909 WARN org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory - Could not close the state stream for s3p://featuretoolkit.c heckpoints/dev_dedup/9b64aafadcd6d367cfedef84706abcba/chk-189/f8e668dd-8019-4830-ab12-d48940ff5353.
1363 java.io.IOException: No space left on device
1364 at java.io.FileOutputStream.writeBytes(Native Method)
1365 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1366 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1367 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1368 at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
1369 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1370 at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:986)
1371 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
1372 at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
1373 at org.apache.flink.fs.s3.common.hadoop.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
1374 at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
1375 at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.close(FsCheckpointStreamFactory.java:269)
1376 at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider.close(CheckpointStreamWithResultProvider.java:58)
1377 at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263)
1378 at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:250)
1379 at org.apache.flink.util.AbstractCloseableRegistry.close(AbstractCloseableRegistry.java:122)
1380 at org.apache.flink.runtime.state.AsyncSnapshotCallable.closeSnapshotIO(AsyncSnapshotCallable.java:185)
1381 at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:84)
1382 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1383 at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
1384 at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
1385 at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
1386 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
1387 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
1388 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1389 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1390 at java.lang.Thread.run(Thread.java:748)
1391 Suppressed: java.io.IOException: No space left on device
1392 at java.io.FileOutputStream.writeBytes(Native Method)
1393 at java.io.FileOutputStream.write(FileOutputStream.java:326)
1394 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
1395 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
1396 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
1397 at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
1398 ... 21 more


Thanks,
Vishwas