Hi

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Hi

kant kodali
Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Hi

brian.wolfe
Hi Kant,

Jumping in here, would love corrections if I'm wrong about any of this.

In short answer, no, HDFS is not necessary to run stateful stream processing. In the minimal case, you can use the MemoryStateBackend to back up your state onto the JobManager.

In any production scenario, you will want more durability for your checkpoints and larger state size. To do this, you should use either RocksDBStateBackend or FsStateBackend. Assuming you want one of these, you will need a checkpoint directory on a filesystem that is accessible by all TaskManagers. The filesystem for this checkpointing directory (state.backend.*.checkpointdir) can be a shared drive or anything supported by the Hadoop file backend see: 
under Hadoop Compatible File Systems for other alternatives (S3, for example).

Choosing RocksDBStateBackend vs. FsStateBackend is a different decision. FsStateBackend stores in-flight state in memory and writes it to your durable filesystem only when checkpoints are initiated. The RocksDBStateBackend stores in-flight data on local disk (in RocksDB) instead of in-memory. When checkpoints are initiated, the appropriate state is then written to the durable filesystem. Because it stores state on disk, RocksDBStateBackend can handle much larger state than FsStateBackend on equivalent hardware.

I'm drawing most of this from this page:

Does that make sense?

Cheers,
Wolfe

~
Brian Wolfe


On Fri, Apr 7, 2017 at 2:32 AM, kant kodali <[hidden email]> wrote:
Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Hi

Fabian Hueske-2
Hi Wolfe,

that's all correct. Thank you!

I'd like to emphasize that the FsStateBackend stores all state on the heap of the worker JVM. So you might run into OutOfMemoryErrors if you state grows too large.
Therefore, the RocksDBStatebackend is the recommended choice for most production use cases.

Best, Fabian

2017-04-07 16:34 GMT+02:00 Brian Wolfe <[hidden email]>:
Hi Kant,

Jumping in here, would love corrections if I'm wrong about any of this.

In short answer, no, HDFS is not necessary to run stateful stream processing. In the minimal case, you can use the MemoryStateBackend to back up your state onto the JobManager.

In any production scenario, you will want more durability for your checkpoints and larger state size. To do this, you should use either RocksDBStateBackend or FsStateBackend. Assuming you want one of these, you will need a checkpoint directory on a filesystem that is accessible by all TaskManagers. The filesystem for this checkpointing directory (state.backend.*.checkpointdir) can be a shared drive or anything supported by the Hadoop file backend see: 
under Hadoop Compatible File Systems for other alternatives (S3, for example).

Choosing RocksDBStateBackend vs. FsStateBackend is a different decision. FsStateBackend stores in-flight state in memory and writes it to your durable filesystem only when checkpoints are initiated. The RocksDBStateBackend stores in-flight data on local disk (in RocksDB) instead of in-memory. When checkpoints are initiated, the appropriate state is then written to the durable filesystem. Because it stores state on disk, RocksDBStateBackend can handle much larger state than FsStateBackend on equivalent hardware.

I'm drawing most of this from this page:

Does that make sense?

Cheers,
Wolfe

~
Brian Wolfe


On Fri, Apr 7, 2017 at 2:32 AM, kant kodali <[hidden email]> wrote:
Hi All,

I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing?

Thanks!