Flink RocksDB logs filling up disk space

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink RocksDB logs filling up disk space

hassahma
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());

Best Regards,



Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

Chesnay Schepler

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,


Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

hassahma

Thanks Chesnay!

On Mon, 27 Jan 2020 at 11:29, Chesnay Schepler <[hidden email]> wrote:

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,


Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

Yun Tang
Hi Ahmad

Apart from setting the logger level of RocksDB, I also wonder why you would meet rocksdb checkpoint IO logs were filling up disk space very very quickly. How larger the local checkpoint state is and how long the checkpoint interval is? I think you might give a too short interval of checkpoint, even you could avoid to record too many logs, and I don't think current checkpoint configuration is appropriate.

Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Monday, January 27, 2020 20:22
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 

Thanks Chesnay!

On Mon, 27 Jan 2020 at 11:29, Chesnay Schepler <[hidden email]> wrote:

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,


Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

hassahma
Hi Yun,

Thank you for pointing that out. In our production landscapes with live customers, we have 10 second checkpoint interval and 7MB of average checkpoint size. We do incremental checkpoints. If we keep the checkpoint interval longer (i.e. 1 minute) then the kafka consumer lag starts increasing. The reason is that over the period of 1 minute, the checkpoint size grows and the job takes long time to do the checkpoint and as a result kafka consumer lag for our live traffic goes high. In order to keep checkpoint size small, we tried 10 second option which is working out well and our kafka lag never exceeds beyond 20 messages on average. But i agree with you that 10 second option does not feel right and is too frequent in my opinion. 

Do you have any recommendations for checkpointing interval please ?

Best Regards,


On Tue, 28 Jan 2020 at 07:46, Yun Tang <[hidden email]> wrote:
Hi Ahmad

Apart from setting the logger level of RocksDB, I also wonder why you would meet rocksdb checkpoint IO logs were filling up disk space very very quickly. How larger the local checkpoint state is and how long the checkpoint interval is? I think you might give a too short interval of checkpoint, even you could avoid to record too many logs, and I don't think current checkpoint configuration is appropriate.

Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Monday, January 27, 2020 20:22
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 

Thanks Chesnay!

On Mon, 27 Jan 2020 at 11:29, Chesnay Schepler <[hidden email]> wrote:

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,


Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

Yun Tang
Hi Ahmad

We mainly recommend our user to set the checkpoint interval as three minutes.
If you don't rely on the keyed state to persistence, you could also disable checkpoint and let the kafka client to commit offset automatically [1] which might the most light-weight solution.



Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Tuesday, January 28, 2020 17:43
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 
Hi Yun,

Thank you for pointing that out. In our production landscapes with live customers, we have 10 second checkpoint interval and 7MB of average checkpoint size. We do incremental checkpoints. If we keep the checkpoint interval longer (i.e. 1 minute) then the kafka consumer lag starts increasing. The reason is that over the period of 1 minute, the checkpoint size grows and the job takes long time to do the checkpoint and as a result kafka consumer lag for our live traffic goes high. In order to keep checkpoint size small, we tried 10 second option which is working out well and our kafka lag never exceeds beyond 20 messages on average. But i agree with you that 10 second option does not feel right and is too frequent in my opinion. 

Do you have any recommendations for checkpointing interval please ?

Best Regards,


On Tue, 28 Jan 2020 at 07:46, Yun Tang <[hidden email]> wrote:
Hi Ahmad

Apart from setting the logger level of RocksDB, I also wonder why you would meet rocksdb checkpoint IO logs were filling up disk space very very quickly. How larger the local checkpoint state is and how long the checkpoint interval is? I think you might give a too short interval of checkpoint, even you could avoid to record too many logs, and I don't think current checkpoint configuration is appropriate.

Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Monday, January 27, 2020 20:22
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 

Thanks Chesnay!

On Mon, 27 Jan 2020 at 11:29, Chesnay Schepler <[hidden email]> wrote:

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,


Reply | Threaded
Open this post in threaded view
|

Re: Flink RocksDB logs filling up disk space

hassahma
Hello Yun,

With no checkpointing it is even a bigger problem because if we rely on flink auto commit then if it fails to commit once due to any outage or kafka rebalancing then it never retries again and it means full outage on live systems.

For sure we need checkpointing for other reasons too i.e. high availability and state recovery.

Best,

On Tue, 28 Jan 2020 at 14:22, Yun Tang <[hidden email]> wrote:
Hi Ahmad

We mainly recommend our user to set the checkpoint interval as three minutes.
If you don't rely on the keyed state to persistence, you could also disable checkpoint and let the kafka client to commit offset automatically [1] which might the most light-weight solution.



Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Tuesday, January 28, 2020 17:43
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 
Hi Yun,

Thank you for pointing that out. In our production landscapes with live customers, we have 10 second checkpoint interval and 7MB of average checkpoint size. We do incremental checkpoints. If we keep the checkpoint interval longer (i.e. 1 minute) then the kafka consumer lag starts increasing. The reason is that over the period of 1 minute, the checkpoint size grows and the job takes long time to do the checkpoint and as a result kafka consumer lag for our live traffic goes high. In order to keep checkpoint size small, we tried 10 second option which is working out well and our kafka lag never exceeds beyond 20 messages on average. But i agree with you that 10 second option does not feel right and is too frequent in my opinion. 

Do you have any recommendations for checkpointing interval please ?

Best Regards,


On Tue, 28 Jan 2020 at 07:46, Yun Tang <[hidden email]> wrote:
Hi Ahmad

Apart from setting the logger level of RocksDB, I also wonder why you would meet rocksdb checkpoint IO logs were filling up disk space very very quickly. How larger the local checkpoint state is and how long the checkpoint interval is? I think you might give a too short interval of checkpoint, even you could avoid to record too many logs, and I don't think current checkpoint configuration is appropriate.

Best
Yun Tang

From: Ahmad Hassan <[hidden email]>
Sent: Monday, January 27, 2020 20:22
To: user <[hidden email]>
Subject: Re: Flink RocksDB logs filling up disk space
 

Thanks Chesnay!

On Mon, 27 Jan 2020 at 11:29, Chesnay Schepler <[hidden email]> wrote:

On 27/01/2020 12:22, Ahmad Hassan wrote:
Hello,

In our production systems, we see that flink rocksdb checkpoint IO logs are filling up disk space very very quickly in the order of GB's as the logging is very verbose. How do we disable or suppress these logs please ? The rocksdb file checkpoint.cc is dumping huge amount of checkpoint logs like

Log(db_options.info_log, "Hard Linking %s", src_fname.c_str());


Best Regards,