(DEPRECATED) Apache Flink User Mailing List archive.

checkpointing opening too many file

Classic

List

Threaded

4 messages Options

ysn2233

checkpointing opening too many file

Hi everyone

We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing. We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error. We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!

regards

Shengnan

Congxian Qiu

Re: checkpointing opening too many file

If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.

[1] https://issues.apache.org/jira/browse/FLINK-11937

Best,

Congxian

ysnakie <[hidden email]> 于2020年4月24日周五下午11:30写道：

Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing. We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error. We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan

Congxian Qiu

Re: checkpointing opening too many file

Yes, for your use case, if you do not have large state size, you can try to use FsStateBackend.

Best,

Congxian

ysnakie <[hidden email]> 于2020年4月27日周一下午3:42写道：

Hi
If I use FsStateBackend instead of RocksdbFsStateBackend, will the open files decrease significantly? I dont have large state size.

thanks

On 4/25/2020 13:48，[hidden email] wrote：

Hi
If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.

[1] https://issues.apache.org/jira/browse/FLINK-11937
Best,
Congxian

ysnakie <[hidden email]> 于2020年4月24日周五下午11:30写道：

Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing. We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error. We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan

David Anderson-3

Re: checkpointing opening too many file

With the FsStateBackend you could also try increasing the value of state.backend.fs.memory-threshold [1]. Only those state chunks that are larger than this value are stored in separate files; smaller chunks go into the checkpoint metadata file. The default is 1KB, increasing this should reduce filesystem stress for small state.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#state-backend-fs-memory-threshold

Best,

David

On Wed, May 6, 2020 at 12:36 PM Congxian Qiu <[hidden email]> wrote:

Hi

Yes, for your use case, if you do not have large state size, you can try to use FsStateBackend.
Best,
Congxian

ysnakie <[hidden email]> 于2020年4月27日周一下午3:42写道：

Hi
If I use FsStateBackend instead of RocksdbFsStateBackend, will the open files decrease significantly? I dont have large state size.

thanks

On 4/25/2020 13:48，[hidden email] wrote：

Hi
If there are indeed so many files need to upload to hdfs, then currently we do not have any solutions to limit the open files, there exist an issue[1] wants to fix this problem, and a pr for it, maybe you can try the attached pr to try it can solve your problem.

[1] https://issues.apache.org/jira/browse/FLINK-11937
Best,
Congxian

ysnakie <[hidden email]> 于2020年4月24日周五下午11:30写道：

Hi everyone
We have a Flink Job to write files to HDFS's different directories. It will open many files due to its high parallelism. I also found that if using rocksdb state backend, it will have even more files open during the checkpointing. We use yarn to schedule Flink job. However yarn always schedule taskmanagers to the same machine and I cannot control it! So the datanode will get very very high pressure and always throw a "bad link" error. We hava already increase the xiceviers limit of HDFS to 16384

Any idea to solve this problem? reduce the number of opening file or control the yarn scheduling to put taskmanager on different machines!

Thank you very much!
regards

Shengnan