Difference between windows in Spark and Flink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Difference between windows in Spark and Flink

Felipe Gutierrez
Hi all,

I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
Reply | Threaded
Open this post in threaded view
|

Re: Difference between windows in Spark and Flink

Yun Tang
Hi Felipe

Generally speaking, the key difference which impacts the performance is where they store data within windows.
For Flink, it would store data and its related time-stamp within windows in state backend[1]. Once window is triggered, it would pull all the stored timer with coupled record-key, and then use the record-key to query state backend for next actions.

For Spark, first of all, we would talk about structured streaming [2] as it's better than previous spark streaming especially on window scenario. Unlike Flink built-in supported rocksDB state backend, Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach and might come across OOM errors[3][4][5].

To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not open-source. We're lucky that open-source Flink already offers built-in RocksDB state backend to avoid OOM problem. Moreover, Flink community recently are developing spill-able memory state backend [7].


Best
Yun Tang




From: Felipe Gutierrez <[hidden email]>
Sent: Thursday, October 10, 2019 20:39
To: user <[hidden email]>
Subject: Difference between windows in Spark and Flink
 
Hi all,

I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
Reply | Threaded
Open this post in threaded view
|

Re: Difference between windows in Spark and Flink

Felipe Gutierrez
Hi Yun,

that is a very complete answer. Thanks!

I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data [1] to wide-dependent operators every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez


On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <[hidden email]> wrote:
Hi Felipe

Generally speaking, the key difference which impacts the performance is where they store data within windows.
For Flink, it would store data and its related time-stamp within windows in state backend[1]. Once window is triggered, it would pull all the stored timer with coupled record-key, and then use the record-key to query state backend for next actions.

For Spark, first of all, we would talk about structured streaming [2] as it's better than previous spark streaming especially on window scenario. Unlike Flink built-in supported rocksDB state backend, Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach and might come across OOM errors[3][4][5].

To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not open-source. We're lucky that open-source Flink already offers built-in RocksDB state backend to avoid OOM problem. Moreover, Flink community recently are developing spill-able memory state backend [7].


Best
Yun Tang




From: Felipe Gutierrez <[hidden email]>
Sent: Thursday, October 10, 2019 20:39
To: user <[hidden email]>
Subject: Difference between windows in Spark and Flink
 
Hi all,

I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
Reply | Threaded
Open this post in threaded view
|

Re: Difference between windows in Spark and Flink

Yun Tang
Hi Felipe

From what I remember, Spark still use micro-batch to shuffle data in structed streaming.
For Flink, it actually process elements per record, there is no actual disk-io shuffle in Flink streaming. And record would emit to downstream by select specific channel through network[1]. That's why we need to call "keyBy" before using windows, "KeyGroupStreamPartitioner" would then be used to select the target channel based on the key group index. Data would first be stored in local state backend and wait for polled out once a window triggered but not "shuffled" until a window triggered.



Best
Yun Tang


From: Felipe Gutierrez <[hidden email]>
Sent: Friday, October 11, 2019 15:47
To: Yun Tang <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Difference between windows in Spark and Flink
 
Hi Yun,

that is a very complete answer. Thanks!

I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data [1] to wide-dependent operators every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez


On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <[hidden email]> wrote:
Hi Felipe

Generally speaking, the key difference which impacts the performance is where they store data within windows.
For Flink, it would store data and its related time-stamp within windows in state backend[1]. Once window is triggered, it would pull all the stored timer with coupled record-key, and then use the record-key to query state backend for next actions.

For Spark, first of all, we would talk about structured streaming [2] as it's better than previous spark streaming especially on window scenario. Unlike Flink built-in supported rocksDB state backend, Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach and might come across OOM errors[3][4][5].

To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not open-source. We're lucky that open-source Flink already offers built-in RocksDB state backend to avoid OOM problem. Moreover, Flink community recently are developing spill-able memory state backend [7].


Best
Yun Tang




From: Felipe Gutierrez <[hidden email]>
Sent: Thursday, October 10, 2019 20:39
To: user <[hidden email]>
Subject: Difference between windows in Spark and Flink
 
Hi all,

I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
Reply | Threaded
Open this post in threaded view
|

Re: Difference between windows in Spark and Flink

Felipe Gutierrez
that is nice. So, only by this Flink shuffles fewer data them Spark.
Now I need to plug Prometheus and Grafana to show it.

Thanks Yun for your help!

--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez


On Fri, Oct 11, 2019 at 12:08 PM Yun Tang <[hidden email]> wrote:
Hi Felipe

From what I remember, Spark still use micro-batch to shuffle data in structed streaming.
For Flink, it actually process elements per record, there is no actual disk-io shuffle in Flink streaming. And record would emit to downstream by select specific channel through network[1]. That's why we need to call "keyBy" before using windows, "KeyGroupStreamPartitioner" would then be used to select the target channel based on the key group index. Data would first be stored in local state backend and wait for polled out once a window triggered but not "shuffled" until a window triggered.



Best
Yun Tang


From: Felipe Gutierrez <[hidden email]>
Sent: Friday, October 11, 2019 15:47
To: Yun Tang <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Difference between windows in Spark and Flink
 
Hi Yun,

that is a very complete answer. Thanks!

I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data [1] to wide-dependent operators every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez


On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <[hidden email]> wrote:
Hi Felipe

Generally speaking, the key difference which impacts the performance is where they store data within windows.
For Flink, it would store data and its related time-stamp within windows in state backend[1]. Once window is triggered, it would pull all the stored timer with coupled record-key, and then use the record-key to query state backend for next actions.

For Spark, first of all, we would talk about structured streaming [2] as it's better than previous spark streaming especially on window scenario. Unlike Flink built-in supported rocksDB state backend, Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach and might come across OOM errors[3][4][5].

To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not open-source. We're lucky that open-source Flink already offers built-in RocksDB state backend to avoid OOM problem. Moreover, Flink community recently are developing spill-able memory state backend [7].


Best
Yun Tang




From: Felipe Gutierrez <[hidden email]>
Sent: Thursday, October 10, 2019 20:39
To: user <[hidden email]>
Subject: Difference between windows in Spark and Flink
 
Hi all,

I am trying to think about the essential differences between operators in Flink and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster respectively, what are the differences between their physical operators running in the cluster?


Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez