Hi all,
I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation. In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1]. In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2]. Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster? [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#windows Thanks, Felipe |
Hi Felipe
Generally speaking, the key difference which impacts the performance is where they store data within windows.
For Flink, it would store data and its related time-stamp within windows in state backend[1]. Once window is triggered, it would pull all the stored timer with coupled record-key, and then use the record-key to query state backend for next actions.
For Spark, first of all, we would talk about structured streaming [2] as it's better than previous spark streaming especially on window scenario. Unlike Flink built-in supported rocksDB state backend, Spark has only one implementation of state store providers.
It's
HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach and might come across OOM errors[3][4][5].
To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not open-source. We're lucky that open-source Flink already offers built-in RocksDB state backend to avoid OOM problem. Moreover, Flink community recently are developing spill-able memory
state backend [7].
Best
Yun Tang
From: Felipe Gutierrez <[hidden email]>
Sent: Thursday, October 10, 2019 20:39 To: user <[hidden email]> Subject: Difference between windows in Spark and Flink Hi all,
I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1]. In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2]. Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster? [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#windows
Thanks,
Felipe
|
Hi Yun, that is a very complete answer. Thanks! I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data [1] to wide-dependent operators every time mini-batch ends, doesn't it? In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct? Thanks, Felipe On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <[hidden email]> wrote:
|
Hi Felipe
From what I remember, Spark still use micro-batch to shuffle data in structed streaming.
For Flink, it actually process elements per record, there is no actual disk-io shuffle in Flink streaming. And record would emit to downstream by select specific channel through network[1]. That's why we need to call "keyBy"
before using windows, "KeyGroupStreamPartitioner" would then be used to select the target channel based on the key group index. Data would first be stored in local state backend and wait for polled
out once a window triggered but not "shuffled" until a window triggered.
Best
Yun Tang
From: Felipe Gutierrez <[hidden email]>
Sent: Friday, October 11, 2019 15:47 To: Yun Tang <[hidden email]> Cc: user <[hidden email]> Subject: Re: Difference between windows in Spark and Flink Hi Yun,
that is a very complete answer. Thanks!
I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data [1] to wide-dependent operators
every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct?
Thanks,
Felipe
On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <[hidden email]> wrote:
|
that is nice. So, only by this Flink shuffles fewer data them Spark. Now I need to plug Prometheus and Grafana to show it. Thanks Yun for your help! On Fri, Oct 11, 2019 at 12:08 PM Yun Tang <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |