checkpointing opening too many file

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

checkpointing opening too many file

ysn2233
Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing.  We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error.  We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan

Reply | Threaded
Open this post in threaded view
|

Re: checkpointing opening too many file

Congxian Qiu
Hi
If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.


ysnakie <[hidden email]> 于2020年4月24日周五 下午11:30写道:
Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing.  We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error.  We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan

Reply | Threaded
Open this post in threaded view
|

Re: checkpointing opening too many file

Congxian Qiu
Hi

Yes, for your use case, if you do not have large state size, you can try to use FsStateBackend.
Best,
Congxian


ysnakie <[hidden email]> 于2020年4月27日周一 下午3:42写道:
Hi
If I use FsStateBackend instead of RocksdbFsStateBackend, will the open files decrease significantly? I dont have large state size.

thanks
On 4/25/2020 13:48[hidden email] wrote:
Hi
If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.


ysnakie <[hidden email]> 于2020年4月24日周五 下午11:30写道:
Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing.  We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error.  We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan

Reply | Threaded
Open this post in threaded view
|

Re: checkpointing opening too many file

David Anderson-3
With the FsStateBackend you could also try increasing the value of state.backend.fs.memory-threshold [1]. Only those state chunks that are larger than this value are stored in separate files; smaller chunks go into the checkpoint metadata file. The default is 1KB, increasing this should reduce filesystem stress for small state.


On Wed, May 6, 2020 at 12:36 PM Congxian Qiu <[hidden email]> wrote:
Hi

Yes, for your use case, if you do not have large state size, you can try to use FsStateBackend.
Best,
Congxian


ysnakie <[hidden email]> 于2020年4月27日周一 下午3:42写道:
Hi
If I use FsStateBackend instead of RocksdbFsStateBackend, will the open files decrease significantly? I dont have large state size.

thanks
On 4/25/2020 13:48[hidden email] wrote:
Hi
If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.


ysnakie <[hidden email]> 于2020年4月24日周五 下午11:30写道:
Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing.  We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error.  We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan