rocksdb max open file descriptor issue crashed application

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

rocksdb max open file descriptor issue crashed application

ApoorvK
flink app is crashing due to "too many file opens" issue , currently app is
having 300 operator and 60GB is the state size. suddenly app is opening 35k
around files which was 20k few weeks before, hence app is crashing, I have
updated the machine as well as yarn limit to 60k hoping it will not crash
again.
Please suggest if there is any alternative solution for this



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: rocksdb max open file descriptor issue crashed application

Congxian Qiu
Hi
From the given description, you use RocksDBStateBackend, and will always open 20k files in one machine, and app suddenly opened 35K files than crashed.
Could you please share what are the opened files?   and what the exception (given the full taskmanager.log maybe helpful)

Best,
Congxian


ApoorvK <[hidden email]> 于2020年2月11日周二 下午5:22写道:
flink app is crashing due to "too many file opens" issue , currently app is
having 300 operator and 60GB is the state size. suddenly app is opening 35k
around files which was 20k few weeks before, hence app is crashing, I have
updated the machine as well as yarn limit to 60k hoping it will not crash
again.
Please suggest if there is any alternative solution for this



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: rocksdb max open file descriptor issue crashed application

ApoorvK
Hi,

Below is the error I am getting :

2020-02-08 05:40:24,543 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18) switched from RUNNING to CANCELING.
2020-02-08 05:40:24,543 INFO  org.apache.flink.runtime.taskmanager.Task                    - Triggering cancellation of task code order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18).
2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder  - Caught unexpected exception.
java.io.IOException: Error while opening RocksDB instance.
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74)
at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.rocksdb.RocksDBException: While open directory: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_51abbbda2947171827fd9e53509c2fb4__4_6__uuid_3f8c7b20-6d17-43ad-a016-8d08f7ed9d50/db: Too many open files
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:286)
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66)
... 17 more
2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9) switched from CANCELING to CANCELED.
2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - Freeing task resources for order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9).
2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder  - Caught unexpected exception.
java.io.IOException: Error while opening RocksDB instance.
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74)
at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.rocksdb.RocksDBException: While opendir: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_069308bcb6f685b62dae685c4647854e__5_6__uuid_146bf5c2-cbc9-4ae2-8fea-9f8b021b8dac/db: Too many open files
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:286)
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66)
... 17 more
2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-status-mapping-join (5/6) (e768888bd12b78d79e7d03d7cce315be) switched from CANCELING to CANCELED.




And now it is increase to 46.9k I have set the ulimit to 60k on all the machine but I am afraid it will exceed this in some time.

Regards

On Tue, Feb 11, 2020 at 9:19 PM Congxian Qiu <[hidden email]> wrote:
Hi
From the given description, you use RocksDBStateBackend, and will always open 20k files in one machine, and app suddenly opened 35K files than crashed.
Could you please share what are the opened files?   and what the exception (given the full taskmanager.log maybe helpful)

Best,
Congxian


ApoorvK <[hidden email]> 于2020年2月11日周二 下午5:22写道:
flink app is crashing due to "too many file opens" issue , currently app is
having 300 operator and 60GB is the state size. suddenly app is opening 35k
around files which was 20k few weeks before, hence app is crashing, I have
updated the machine as well as yarn limit to 60k hoping it will not crash
again.
Please suggest if there is any alternative solution for this



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: rocksdb max open file descriptor issue crashed application

Kostas Kloudas-2
Hi Apoorv,

I am not so familiar with the internal of RocksDB and how the number
of open files correlates with the number of (keyed) states and the
parallelism you have, but as a starting point you can have a look to
[1] for recommendations on how to tune RocksDb for large state and I
am also cc'ing Andrey who may have some more knowledge on the topic.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#incremental-checkpoints

Cheers,
Kostas

On Wed, Feb 12, 2020 at 7:55 AM Apoorv Upadhyay
<[hidden email]> wrote:

>
> Hi,
>
> Below is the error I am getting :
>
> 2020-02-08 05:40:24,543 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18) switched from RUNNING to CANCELING.
> 2020-02-08 05:40:24,543 INFO  org.apache.flink.runtime.taskmanager.Task                    - Triggering cancellation of task code order-steamBy-api-order-ip (3/6) (34c7b05d5a75dbbcc5718a1111cf6b18).
> 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder  - Caught unexpected exception.
> java.io.IOException: Error while opening RocksDB instance.
> at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74)
> at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
> at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268)
> at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
> at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
> at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.rocksdb.RocksDBException: While open directory: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_51abbbda2947171827fd9e53509c2fb4__4_6__uuid_3f8c7b20-6d17-43ad-a016-8d08f7ed9d50/db: Too many open files
> at org.rocksdb.RocksDB.open(Native Method)
> at org.rocksdb.RocksDB.open(RocksDB.java:286)
> at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66)
> ... 17 more
> 2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9) switched from CANCELING to CANCELED.
> 2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - Freeing task resources for order-status-mapping-join (4/6) (4409b4e2d93f0441100f0f1575a1dcb9).
> 2020-02-08 05:40:24,543 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder  - Caught unexpected exception.
> java.io.IOException: Error while opening RocksDB instance.
> at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74)
> at org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
> at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
> at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:268)
> at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
> at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
> at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
> at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.rocksdb.RocksDBException: While opendir: /hadoop/yarn/local/usercache/flink/appcache/application_1580464300238_0045/flink-io-d947dea6-270b-44c0-94ca-4a49dbf02f52/job_97167effbb11a8e9ffcba36be7e4da80_op_CoStreamFlatMap_069308bcb6f685b62dae685c4647854e__5_6__uuid_146bf5c2-cbc9-4ae2-8fea-9f8b021b8dac/db: Too many open files
> at org.rocksdb.RocksDB.open(Native Method)
> at org.rocksdb.RocksDB.open(RocksDB.java:286)
> at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66)
> ... 17 more
> 2020-02-08 05:40:24,544 INFO  org.apache.flink.runtime.taskmanager.Task                    - order-status-mapping-join (5/6) (e768888bd12b78d79e7d03d7cce315be) switched from CANCELING to CANCELED.
>
>
>
>
> And now it is increase to 46.9k I have set the ulimit to 60k on all the machine but I am afraid it will exceed this in some time.
>
> Regards
>
> On Tue, Feb 11, 2020 at 9:19 PM Congxian Qiu <[hidden email]> wrote:
>>
>> Hi
>> From the given description, you use RocksDBStateBackend, and will always open 20k files in one machine, and app suddenly opened 35K files than crashed.
>> Could you please share what are the opened files?   and what the exception (given the full taskmanager.log maybe helpful)
>>
>> Best,
>> Congxian
>>
>>
>> ApoorvK <[hidden email]> 于2020年2月11日周二 下午5:22写道:
>>>
>>> flink app is crashing due to "too many file opens" issue , currently app is
>>> having 300 operator and 60GB is the state size. suddenly app is opening 35k
>>> around files which was 20k few weeks before, hence app is crashing, I have
>>> updated the machine as well as yarn limit to 60k hoping it will not crash
>>> again.
>>> Please suggest if there is any alternative solution for this
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/