Does Flink operators synchronize states?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Does Flink operators synchronize states?

Yuta Morisawa
Hello,

I am wondering whether Flink operators synchronize their execution
states like Apache Spark. In Apache Spark, the master decides
everything, for example, it schedules jobs and assigns tasks to
Executors so that each job is executed in a synchronized way. But Flink
looks different. It appears that each TaskManagers are dedicated to
specific operators and they asynchronously execute tasks. Is this
understanding correct?

In short, I want to know how Flink assigns tasks to TaskManagers and how
manage them because I think it is important for performance tuning.
Could you tell me If you have any detail documentation?

Regards,
Yuta
--
Reply | Threaded
Open this post in threaded view
|

Re: Does Flink operators synchronize states?

Arvid Heise-3
Hi Yuta,

there are indeed a few important differences between Spark and Flink. However, please also note that different APIs behave differently on both systems. So it would be good if you could clarify what you are doing, so I can go in more detail.

As a starting point, you can always check the architecture overview page [1] of Flink.

Then keep in mind that Flink approaches the whole scheduling from a streaming perspective and Spark from a batch perspective. In Flink, most tasks are always running with a few exceptions (pure batch API = Spark default), whereas in Spark tasks are usually scheduled in waves with a few exceptions (continuous processing in structured streaming = Flink default).

Note that there is also quite a bit moving in both systems. In Flink, we try to get rid of the old batch subsystem and fully integrate it in streaming, such that the actual scheduling mode is determined more dynamically for parts of the whole application. Think of a job where you need to do some batch preprocessing to build up some dictionary and then use it to enrich streaming data. During next year, Flink should be able to fully exploit the data properties of streaming and batch tasks of the same application. In Spark, they also seem to work towards supporting more complex applications in continuous processing mode (so beyond the current embarrassing parallel operations), for which they may also need to revise their scheduling model.


On Fri, Oct 30, 2020 at 10:05 AM Yuta Morisawa <[hidden email]> wrote:
Hello,

I am wondering whether Flink operators synchronize their execution
states like Apache Spark. In Apache Spark, the master decides
everything, for example, it schedules jobs and assigns tasks to
Executors so that each job is executed in a synchronized way. But Flink
looks different. It appears that each TaskManagers are dedicated to
specific operators and they asynchronously execute tasks. Is this
understanding correct?

In short, I want to know how Flink assigns tasks to TaskManagers and how
manage them because I think it is important for performance tuning.
Could you tell me If you have any detail documentation?

Regards,
Yuta
--


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Does Flink operators synchronize states?

Yuta Morisawa
Hi Arvid,

Thank you for your detailed answer. I read your answer and finally found
that I did not understand well on the difference between micro-batch
model and continuous(one-by-one) processing model. I am familiar with
micro-batch model but not with continuous one. So, I will search some
documentation on it. Thank you again your answer.

Regards,
Yuta

On 2020/11/02 1:07, Arvid Heise wrote:

> Hi Yuta,
>
> there are indeed a few important differences between Spark and Flink.
> However, please also note that different APIs behave differently on both
> systems. So it would be good if you could clarify what you are doing, so
> I can go in more detail.
>
> As a starting point, you can always check the architecture overview page
> [1] of Flink.
>
> Then keep in mind that Flink approaches the whole scheduling from a
> streaming perspective and Spark from a batch perspective. In Flink, most
> tasks are always running with a few exceptions (pure batch API = Spark
> default), whereas in Spark tasks are usually scheduled in waves with a
> few exceptions (continuous processing in structured streaming = Flink
> default).
>
> Note that there is also quite a bit moving in both systems. In Flink, we
> try to get rid of the old batch subsystem and fully integrate it in
> streaming, such that the actual scheduling mode is determined more
> dynamically for parts of the whole application. Think of a job where you
> need to do some batch preprocessing to build up some dictionary and then
> use it to enrich streaming data. During next year, Flink should be able
> to fully exploit the data properties of streaming and batch tasks of the
> same application. In Spark, they also seem to work towards supporting
> more complex applications in continuous processing mode (so beyond the
> current embarrassing parallel operations), for which they may also need
> to revise their scheduling model.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html
>
> On Fri, Oct 30, 2020 at 10:05 AM Yuta Morisawa
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hello,
>
>     I am wondering whether Flink operators synchronize their execution
>     states like Apache Spark. In Apache Spark, the master decides
>     everything, for example, it schedules jobs and assigns tasks to
>     Executors so that each job is executed in a synchronized way. But Flink
>     looks different. It appears that each TaskManagers are dedicated to
>     specific operators and they asynchronously execute tasks. Is this
>     understanding correct?
>
>     In short, I want to know how Flink assigns tasks to TaskManagers and
>     how
>     manage them because I think it is important for performance tuning.
>     Could you tell me If you have any detail documentation?
>
>     Regards,
>     Yuta
>     --
>
>
>
> --
>
> Arvid Heise| Senior Java Developer
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/>- The Apache FlinkConference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng