(DEPRECATED) Apache Flink User Mailing List archive.

Hi

Classic

List

Threaded

3 messages Options

kant kodali

Hi

Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!

brian.wolfe

Re: Hi

Hi Kant,

Jumping in here, would love corrections if I'm wrong about any of this.

In short answer, no, HDFS is not necessary to run stateful stream processing. In the minimal case, you can use the MemoryStateBackend to back up your state onto the JobManager.

In any production scenario, you will want more durability for your checkpoints and larger state size. To do this, you should use either RocksDBStateBackend or FsStateBackend. Assuming you want one of these, you will need a checkpoint directory on a filesystem that is accessible by all TaskManagers. The filesystem for this checkpointing directory (state.backend.*.checkpointdir) can be a shared drive or anything supported by the Hadoop file backend see:

https://hadoop.apache.org/docs/stable/index.html

under Hadoop Compatible File Systems for other alternatives (S3, for example).

Choosing RocksDBStateBackend vs. FsStateBackend is a different decision. FsStateBackend stores in-flight state in memory and writes it to your durable filesystem only when checkpoints are initiated. The RocksDBStateBackend stores in-flight data on local disk (in RocksDB) instead of in-memory. When checkpoints are initiated, the appropriate state is then written to the durable filesystem. Because it stores state on disk, RocksDBStateBackend can handle much larger state than FsStateBackend on equivalent hardware.

I'm drawing most of this from this page:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html

Does that make sense?

Cheers,

Wolfe

Brian Wolfe

On Fri, Apr 7, 2017 at 2:32 AM, kant kodali <[hidden email]> wrote:

Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!

Fabian Hueske-2

Re: Hi

Hi Wolfe,

that's all correct. Thank you!

I'd like to emphasize that the FsStateBackend stores all state on the heap of the worker JVM. So you might run into OutOfMemoryErrors if you state grows too large.

Therefore, the RocksDBStatebackend is the recommended choice for most production use cases.

Best, Fabian

2017-04-07 16:34 GMT+02:00 Brian Wolfe <[hidden email]>:

Hi Kant,

Jumping in here, would love corrections if I'm wrong about any of this.

In short answer, no, HDFS is not necessary to run stateful stream processing. In the minimal case, you can use the MemoryStateBackend to back up your state onto the JobManager.

In any production scenario, you will want more durability for your checkpoints and larger state size. To do this, you should use either RocksDBStateBackend or FsStateBackend. Assuming you want one of these, you will need a checkpoint directory on a filesystem that is accessible by all TaskManagers. The filesystem for this checkpointing directory (state.backend.*.checkpointdir) can be a shared drive or anything supported by the Hadoop file backend see:
https://hadoop.apache.org/docs/stable/index.html
under Hadoop Compatible File Systems for other alternatives (S3, for example).

Choosing RocksDBStateBackend vs. FsStateBackend is a different decision. FsStateBackend stores in-flight state in memory and writes it to your durable filesystem only when checkpoints are initiated. The RocksDBStateBackend stores in-flight data on local disk (in RocksDB) instead of in-memory. When checkpoints are initiated, the appropriate state is then written to the durable filesystem. Because it stores state on disk, RocksDBStateBackend can handle much larger state than FsStateBackend on equivalent hardware.

I'm drawing most of this from this page:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html

Does that make sense?

Cheers,
Wolfe

~
Brian Wolfe

On Fri, Apr 7, 2017 at 2:32 AM, kant kodali <[hidden email]> wrote:
Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!