flink app is crashing due to "too many file opens" issue , currently app is
having 300 operator and 60GB is the state size. suddenly app is opening 35k around files which was 20k few weeks before, hence app is crashing, I have updated the machine as well as yarn limit to 60k hoping it will not crash again. Please suggest if there is any alternative solution for this -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi From the given description, you use RocksDBStateBackend, and will always open 20k files in one machine, and app suddenly opened 35K files than crashed. Could you please share what are the opened files? and what the exception (given the full taskmanager.log maybe helpful) Best, Congxian ApoorvK <[hidden email]> 于2020年2月11日周二 下午5:22写道: flink app is crashing due to "too many file opens" issue , currently app is |
Hi, Below is the error I am getting : 2020-02-08 05:40:24,543 INFO org.apache.flink.runtime.taskmanager.Task - order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18) switched from RUNNING to CANCELING. 2020-02-08 05:40:24,543 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18). 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder - Caught unexpected exception. java.io.IOException: Error while opening RocksDB instance. at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74) at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268) at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.rocksdb.RocksDBException: While open directory: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_51abbbda2947171827fd9e53509c2fb4__4_6__uuid_3f8c7b20-6d17-43ad-a016-8d08f7ed9d50/db: Too many open files at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:286) at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66) ... 17 more 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9) switched from CANCELING to CANCELED. 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9). 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder - Caught unexpected exception. java.io.IOException: Error while opening RocksDB instance. at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74) at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268) at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.rocksdb.RocksDBException: While opendir: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_069308bcb6f685b62dae685c4647854e__5_6__uuid_146bf5c2-cbc9-4ae2-8fea-9f8b021b8dac/db: Too many open files at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:286) at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66) ... 17 more 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - order-status-mapping-join (5/6) (e768888bd12b78d79e7d03d7cce315be) switched from CANCELING to CANCELED. And now it is increase to 46.9k I have set the ulimit to 60k on all the machine but I am afraid it will exceed this in some time. Regards On Tue, Feb 11, 2020 at 9:19 PM Congxian Qiu <[hidden email]> wrote:
|
Hi Apoorv,
I am not so familiar with the internal of RocksDB and how the number of open files correlates with the number of (keyed) states and the parallelism you have, but as a starting point you can have a look to [1] for recommendations on how to tune RocksDb for large state and I am also cc'ing Andrey who may have some more knowledge on the topic. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#incremental-checkpoints Cheers, Kostas On Wed, Feb 12, 2020 at 7:55 AM Apoorv Upadhyay <[hidden email]> wrote: > > Hi, > > Below is the error I am getting : > > 2020-02-08 05:40:24,543 INFO org.apache.flink.runtime.taskmanager.Task - order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18) switched from RUNNING to CANCELING. > 2020-02-08 05:40:24,543 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18). > 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder - Caught unexpected exception. > java.io.IOException: Error while opening RocksDB instance. > at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74) > at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148) > at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268) > at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291) > at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142) > at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) > at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) > at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740) > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.rocksdb.RocksDBException: While open directory: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_51abbbda2947171827fd9e53509c2fb4__4_6__uuid_3f8c7b20-6d17-43ad-a016-8d08f7ed9d50/db: Too many open files > at org.rocksdb.RocksDB.open(Native Method) > at org.rocksdb.RocksDB.open(RocksDB.java:286) > at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66) > ... 17 more > 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9) switched from CANCELING to CANCELED. > 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9). > 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder - Caught unexpected exception. > java.io.IOException: Error while opening RocksDB instance. > at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74) > at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162) > at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148) > at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268) > at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291) > at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142) > at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307) > at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) > at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) > at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740) > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.rocksdb.RocksDBException: While opendir: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_069308bcb6f685b62dae685c4647854e__5_6__uuid_146bf5c2-cbc9-4ae2-8fea-9f8b021b8dac/db: Too many open files > at org.rocksdb.RocksDB.open(Native Method) > at org.rocksdb.RocksDB.open(RocksDB.java:286) > at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66) > ... 17 more > 2020-02-08 05:40:24,544 INFO org.apache.flink.runtime.taskmanager.Task - order-status-mapping-join (5/6) (e768888bd12b78d79e7d03d7cce315be) switched from CANCELING to CANCELED. > > > > > And now it is increase to 46.9k I have set the ulimit to 60k on all the machine but I am afraid it will exceed this in some time. > > Regards > > On Tue, Feb 11, 2020 at 9:19 PM Congxian Qiu <[hidden email]> wrote: >> >> Hi >> From the given description, you use RocksDBStateBackend, and will always open 20k files in one machine, and app suddenly opened 35K files than crashed. >> Could you please share what are the opened files? and what the exception (given the full taskmanager.log maybe helpful) >> >> Best, >> Congxian >> >> >> ApoorvK <[hidden email]> 于2020年2月11日周二 下午5:22写道: >>> >>> flink app is crashing due to "too many file opens" issue , currently app is >>> having 300 operator and 60GB is the state size. suddenly app is opening 35k >>> around files which was 20k few weeks before, hence app is crashing, I have >>> updated the machine as well as yarn limit to 60k hoping it will not crash >>> again. >>> Please suggest if there is any alternative solution for this >>> >>> >>> >>> -- >>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |