Flink Streaming : PartitionBy vs GroupBy differences

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Streaming : PartitionBy vs GroupBy differences

tambunanw
Hi All,

I'm trying to digest what's the difference between this two. From my experience in Spark GroupBy will cause shuffling on the network. Is that the same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink will compile the user code into it's own optimized graph structure so i think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone elaborate ?

I think this one is really confusing for me that come from Spark world. Any help would be really appreciated.

Cheers

Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

Gyula Fóra
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

tambunanw
Hi Gyula, 

Thanks for your response. 

So if i use partitionBy then data point with the same will receive exactly by the same instance of operator ? 


Another question is if i execute reduce() operator on after partitionBy, will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra <[hidden email]> wrote:
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.



--
Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

Gyula Fóra
Hey,

1.
Yes, if you use partitionBy the same key will always go to the same downstream operator instance.

2.
There is only partial ordering guarantee, meaning that data received from one input is FIFO. This means that if the same key is coming from multiple inputs than there is no ordering guarantee there, only inside one input.

Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 11:51):
Hi Gyula, 

Thanks for your response. 

So if i use partitionBy then data point with the same will receive exactly by the same instance of operator ? 


Another question is if i execute reduce() operator on after partitionBy, will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra <[hidden email]> wrote:
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

tambunanw
Hi Gyula, 

Thanks a lot. That's enough for my case. 

I do really love Flink Streaming model compare to Spark Streaming. 

So is that true that i can think that Operator as an Actor model in this system ? Is that a right way to put it ? 



Cheers

On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1.
Yes, if you use partitionBy the same key will always go to the same downstream operator instance.

2.
There is only partial ordering guarantee, meaning that data received from one input is FIFO. This means that if the same key is coming from multiple inputs than there is no ordering guarantee there, only inside one input.

Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 11:51):
Hi Gyula, 

Thanks for your response. 

So if i use partitionBy then data point with the same will receive exactly by the same instance of operator ? 


Another question is if i execute reduce() operator on after partitionBy, will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra <[hidden email]> wrote:
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.






--
Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

Gyula Fóra
Yes, you can think of it that way. Each Operator has parallel instances and each parallel instance receives input from multiple channels (FIFO from each) and produces output.

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 13:02):
Hi Gyula, 

Thanks a lot. That's enough for my case. 

I do really love Flink Streaming model compare to Spark Streaming. 

So is that true that i can think that Operator as an Actor model in this system ? Is that a right way to put it ? 



Cheers

On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1.
Yes, if you use partitionBy the same key will always go to the same downstream operator instance.

2.
There is only partial ordering guarantee, meaning that data received from one input is FIFO. This means that if the same key is coming from multiple inputs than there is no ordering guarantee there, only inside one input.

Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 11:51):
Hi Gyula, 

Thanks for your response. 

So if i use partitionBy then data point with the same will receive exactly by the same instance of operator ? 


Another question is if i execute reduce() operator on after partitionBy, will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra <[hidden email]> wrote:
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.






--
Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming : PartitionBy vs GroupBy differences

tambunanw
Thanks Gyula


Cheers

On Fri, Jul 3, 2015 at 6:19 PM, Gyula Fóra <[hidden email]> wrote:
Yes, you can think of it that way. Each Operator has parallel instances and each parallel instance receives input from multiple channels (FIFO from each) and produces output.

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 13:02):
Hi Gyula, 

Thanks a lot. That's enough for my case. 

I do really love Flink Streaming model compare to Spark Streaming. 

So is that true that i can think that Operator as an Actor model in this system ? Is that a right way to put it ? 



Cheers

On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1.
Yes, if you use partitionBy the same key will always go to the same downstream operator instance.

2.
There is only partial ordering guarantee, meaning that data received from one input is FIFO. This means that if the same key is coming from multiple inputs than there is no ordering guarantee there, only inside one input.

Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 11:51):
Hi Gyula, 

Thanks for your response. 

So if i use partitionBy then data point with the same will receive exactly by the same instance of operator ? 


Another question is if i execute reduce() operator on after partitionBy, will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra <[hidden email]> wrote:
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based on some key, assuring that elements with the same keys end up on the same downstream processing operator.

The difference between the two is that groupBy in addition to this returns a GroupedDataStream which lets you execute some special operations, such as key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to control the messages received by parallel instances (in a mapper for example).

Cheers,
Gyula

tambunanw <[hidden email]> ezt írta (időpont: 2015. júl. 3., P, 10:32):
Hi All,

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ?

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ?

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated.

Cheers





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.






--



--