(DEPRECATED) Apache Flink User Mailing List archive.

Checkpointing very large state in RocksDB?

Classic

List

Threaded

5 messages Options

Daniel Li

Checkpointing very large state in RocksDB?

When RocksDB holds a very large state, is there a concern over the time takes in checkpointing the RocksDB data to HDFS? Is asynchronous checkpointing a recommended practice here?

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

"The RocksDBStateBackend holds in-flight data in a RocksDB data base that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB data base will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).

The RocksDBStateBackend is encouraged for:

Jobs with very large state, long windows, large key/value states.
All high-availability setups."

thx

Daniel

Aljoscha Krettek

Re: Checkpointing very large state in RocksDB?

Hi,

are you taking about enableFullyAsyncSnapshots() in the RocksDB backend. If not, there is this switch that is described in the JavaDoc:

/**
* Enables fully asynchronous snapshotting of the partitioned state held in RocksDB.
*
* <p>By default, this is disabled. This means that RocksDB state is copied in a synchronous
* step, during which normal processing of elements pauses, followed by an asynchronous step
* of copying the RocksDB backup to the final checkpoint location. Fully asynchronous
* snapshots take longer (linear time requirement with respect to number of unique keys)
* but normal processing of elements is not paused.
*/

public void enableFullyAsyncSnapshots()

This also describes the implications on checkpointing time but please let me know if I should provide more details. We should probably also add more description to the documentation for this.

Cheers,

Aljoscha

On Wed, 29 Jun 2016 at 23:04 Daniel Li <[hidden email]> wrote:

When RocksDB holds a very large state, is there a concern over the time takes in checkpointing the RocksDB data to HDFS? Is asynchronous checkpointing a recommended practice here?

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

"The RocksDBStateBackend holds in-flight data in a RocksDB data base that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB data base will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The RocksDBStateBackend is encouraged for:
Jobs with very large state, long windows, large key/value states.
All high-availability setups."

thx
Daniel

Daniel Li

Re: Checkpointing very large state in RocksDB?

Thanks Aljoscha. Yes - that is exactly what I am looking for.

On Thu, Jun 30, 2016 at 5:07 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
are you taking about enableFullyAsyncSnapshots() in the RocksDB backend. If not, there is this switch that is described in the JavaDoc:

/**
* Enables fully asynchronous snapshotting of the partitioned state held in RocksDB.
*
* <p>By default, this is disabled. This means that RocksDB state is copied in a synchronous
* step, during which normal processing of elements pauses, followed by an asynchronous step
* of copying the RocksDB backup to the final checkpoint location. Fully asynchronous
* snapshots take longer (linear time requirement with respect to number of unique keys)
* but normal processing of elements is not paused.
*/
public void enableFullyAsyncSnapshots()

This also describes the implications on checkpointing time but please let me know if I should provide more details. We should probably also add more description to the documentation for this.

Cheers,
Aljoscha

On Wed, 29 Jun 2016 at 23:04 Daniel Li <[hidden email]> wrote:
When RocksDB holds a very large state, is there a concern over the time takes in checkpointing the RocksDB data to HDFS? Is asynchronous checkpointing a recommended practice here?

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

"The RocksDBStateBackend holds in-flight data in a RocksDB data base that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB data base will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The RocksDBStateBackend is encouraged for:
Jobs with very large state, long windows, large key/value states.
All high-availability setups."

thx
Daniel

vishnuviswanath

Re: Checkpointing very large state in RocksDB?

In reply to this post by Aljoscha Krettek

Hi,

Is there any other disadvantage of using fullyAsyncSnapshot, other than being slower. And would the slowness really matter since it is async anyways?

Thanks and Regards,

Vishnu Viswanath,

On Thu, Jun 30, 2016 at 8:07 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
are you taking about enableFullyAsyncSnapshots() in the RocksDB backend. If not, there is this switch that is described in the JavaDoc:

/**
* Enables fully asynchronous snapshotting of the partitioned state held in RocksDB.
*
* <p>By default, this is disabled. This means that RocksDB state is copied in a synchronous
* step, during which normal processing of elements pauses, followed by an asynchronous step
* of copying the RocksDB backup to the final checkpoint location. Fully asynchronous
* snapshots take longer (linear time requirement with respect to number of unique keys)
* but normal processing of elements is not paused.
*/
public void enableFullyAsyncSnapshots()

This also describes the implications on checkpointing time but please let me know if I should provide more details. We should probably also add more description to the documentation for this.

Cheers,
Aljoscha

On Wed, 29 Jun 2016 at 23:04 Daniel Li <[hidden email]> wrote:
When RocksDB holds a very large state, is there a concern over the time takes in checkpointing the RocksDB data to HDFS? Is asynchronous checkpointing a recommended practice here?

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

"The RocksDBStateBackend holds in-flight data in a RocksDB data base that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB data base will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The RocksDBStateBackend is encouraged for:
Jobs with very large state, long windows, large key/value states.
All high-availability setups."

thx
Daniel

Aljoscha Krettek

Re: Checkpointing very large state in RocksDB?

Hi,

I think there is no disadvantage other than the fact that in the JobManager dashboard the checkpoint will be shown as "taking longer". Some people might be confused by this if they don't know that during the whole time the job keeps processing data.

I think async snapshotting might be promoted to the default behavior for the RocksDB state backend in a future release.

Cheers,

Aljoscha

On Tue, 5 Jul 2016 at 21:08 Vishnu Viswanath <[hidden email]> wrote:

Hi,

Is there any other disadvantage of using fullyAsyncSnapshot, other than being slower. And would the slowness really matter since it is async anyways?

Thanks and Regards,
Vishnu Viswanath,

On Thu, Jun 30, 2016 at 8:07 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
are you taking about enableFullyAsyncSnapshots() in the RocksDB backend. If not, there is this switch that is described in the JavaDoc:

/**
* Enables fully asynchronous snapshotting of the partitioned state held in RocksDB.
*
* <p>By default, this is disabled. This means that RocksDB state is copied in a synchronous
* step, during which normal processing of elements pauses, followed by an asynchronous step
* of copying the RocksDB backup to the final checkpoint location. Fully asynchronous
* snapshots take longer (linear time requirement with respect to number of unique keys)
* but normal processing of elements is not paused.
*/
public void enableFullyAsyncSnapshots()

This also describes the implications on checkpointing time but please let me know if I should provide more details. We should probably also add more description to the documentation for this.

Cheers,
Aljoscha

On Wed, 29 Jun 2016 at 23:04 Daniel Li <[hidden email]> wrote:
When RocksDB holds a very large state, is there a concern over the time takes in checkpointing the RocksDB data to HDFS? Is asynchronous checkpointing a recommended practice here?

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

"The RocksDBStateBackend holds in-flight data in a RocksDB data base that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB data base will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The RocksDBStateBackend is encouraged for:
Jobs with very large state, long windows, large key/value states.
All high-availability setups."

thx
Daniel