kafka partitions, data locality

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

kafka partitions, data locality

Smirnov Sergey Vladimirovich (39833)

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing.

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

Reply | Threaded
Open this post in threaded view
|

RE: kafka partitions, data locality

Smirnov Sergey Vladimirovich (39833)

Hi Ken,

 

It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!

4 all, my next step? Contribute to apache flink? Issues backlog?

 

 

With best regards,

Sergey

From: Ken Krugler [mailto:[hidden email]]
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.

 

This is because Flink doesn’t know that client ids don’t span Kafka partitions.

 

I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.

 

A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.

 

— Ken

 

 

On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

 

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

 

Reply | Threaded
Open this post in threaded view
|

Re: kafka partitions, data locality

Dawid Wysakowicz-2

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:

Hi Ken,

 

It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!

4 all, my next step? Contribute to apache flink? Issues backlog?

 

 

With best regards,

Sergey

From: Ken Krugler [[hidden email]]
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.

 

This is because Flink doesn’t know that client ids don’t span Kafka partitions.

 

I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.

 

A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.

 

— Ken

 

 

On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

 

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

 


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: kafka partitions, data locality

Smirnov Sergey Vladimirovich (39833)

Hi,

 

Dawid, great, thanks!

Any plans to make it stable? 1.9?

 

 

Regards,

Sergey

 

From: Dawid Wysakowicz [mailto:[hidden email]]
Sent: Thursday, April 25, 2019 10:54 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>; Ken Krugler <[hidden email]>
Cc: [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:

Hi Ken,

 

It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!

4 all, my next step? Contribute to apache flink? Issues backlog?

 

 

With best regards,

Sergey

From: Ken Krugler [[hidden email]]
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.

 

This is because Flink doesn’t know that client ids don’t span Kafka partitions.

 

I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.

 

A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.

 

— Ken

 

 

On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

 

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

 

Reply | Threaded
Open this post in threaded view
|

Re: kafka partitions, data locality

Stefan Richter-4
Hi Sergey,

The point why this I flagged as beta is actually less about stability but more about the fact that this is supposed to be more of a "power user" feature because bad things can happen if your data is not 100% correctly partitioned in the same way as Flink would partition it. This is why typically you should only use it if the data was partitioned by Flink and you are very sure what your are doing, because the is not really something we can to at the API level to protect you from mistakes in using this feature. Eventually some runtime exceptions might show you that something is going wrong, but that is not exactly a good user experience.

On a different note, there actually is currently one open issue [1] to be aware of in connection with this feature and operator chaining, but at the same time this is something that should not hard to fix in for the next minor release.

Best,
Stefan

[1] https://issues.apache.org/jira/browse/FLINK-12296?focusedCommentId=16824945&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16824945  

On 26. Apr 2019, at 09:48, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

Hi,
 
Dawid, great, thanks!
Any plans to make it stable? 1.9?
 
 
Regards,
Sergey
 
From: Dawid Wysakowicz [[hidden email]] 
Sent: Thursday, April 25, 2019 10:54 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>; Ken Krugler <[hidden email]>
Cc: [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality
 

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:
Hi Ken,
 
It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!
4 all, my next step? Contribute to apache flink? Issues backlog?
 
 
With best regards,
Sergey
From: Ken Krugler [[hidden email]] 
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) [hidden email]
Subject: Re: kafka partitions, data locality
 
Hi Sergey,
 
As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.
 
This is because Flink doesn’t know that client ids don’t span Kafka partitions.
 
I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.
 
A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.
 
— Ken
 
 
On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:
 
Hello,
 
We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.
So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.
I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 
As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.
And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?
Thanks in advance!
 
 
With best regards,
Sergey Smirnov
 
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra
 

Reply | Threaded
Open this post in threaded view
|

RE: kafka partitions, data locality

Smirnov Sergey Vladimirovich (39833)

Hi Stefan,

 

Thnx for clarify!

But still it remains an open question for me because we use keyBy method and I did not found any public interface of keys reassignment (smth like partionCustom for DataStream).

As I heard, there is some internal mechanism with key groups and mapping key to groups. Is it supposed to become public?

 

 

Regards,

Sergey

 

From: Stefan Richter [mailto:[hidden email]]
Sent: Friday, April 26, 2019 11:15 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>
Cc: Dawid Wysakowicz <[hidden email]>; Ken Krugler <[hidden email]>; [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

The point why this I flagged as beta is actually less about stability but more about the fact that this is supposed to be more of a "power user" feature because bad things can happen if your data is not 100% correctly partitioned in the same way as Flink would partition it. This is why typically you should only use it if the data was partitioned by Flink and you are very sure what your are doing, because the is not really something we can to at the API level to protect you from mistakes in using this feature. Eventually some runtime exceptions might show you that something is going wrong, but that is not exactly a good user experience.

 

On a different note, there actually is currently one open issue [1] to be aware of in connection with this feature and operator chaining, but at the same time this is something that should not hard to fix in for the next minor release.

 

Best,

Stefan

 

[1] https://issues.apache.org/jira/browse/FLINK-12296?focusedCommentId=16824945&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16824945  



On 26. Apr 2019, at 09:48, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hi,

 

Dawid, great, thanks!

Any plans to make it stable? 1.9?

 

 

Regards,

Sergey

 

From: Dawid Wysakowicz [[hidden email]] 
Sent: Thursday, April 25, 2019 10:54 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>; Ken Krugler <[hidden email]>
Cc: [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:

Hi Ken,

 

It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!

4 all, my next step? Contribute to apache flink? Issues backlog?

 

 

With best regards,

Sergey

From: Ken Krugler [[hidden email]] 
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.

 

This is because Flink doesn’t know that client ids don’t span Kafka partitions.

 

I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.

 

A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.

 

— Ken

 

 

On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

 

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

 

 

Reply | Threaded
Open this post in threaded view
|

Re: kafka partitions, data locality

Fabian Hueske-2
Hi Sergey,

You are right, keys are managed in key groups. Each key belongs to a key group and one or more key groups are assigned to each parallel task of an operator.
Key groups are not exposed to users and the assignments of keys -> key-groups and key-groups -> tasks cannot be changed without changing Flink itself (i.e., a custom build).
If you don't want to change Flink, you'd need to change the partitioning in Kafka (mapping key-groups to partitions) and ensuring that all partitions are read by the correct task.

I don't think this is possible (with reasonable effort) and if you get it to work it would be quite fragile with respect to changing parallelism (Kafka, Flink) etc.
Right now there is no way around partitioning the events with keyBy() if you want to use keyed state.
After the first keyBy() partitioning, reinterpretAsKeyedStream() can be used to reuse an existing partitioning.

Best, Fabian

Am Mo., 29. Apr. 2019 um 15:23 Uhr schrieb Smirnov Sergey Vladimirovich (39833) <[hidden email]>:

Hi Stefan,

 

Thnx for clarify!

But still it remains an open question for me because we use keyBy method and I did not found any public interface of keys reassignment (smth like partionCustom for DataStream).

As I heard, there is some internal mechanism with key groups and mapping key to groups. Is it supposed to become public?

 

 

Regards,

Sergey

 

From: Stefan Richter [mailto:[hidden email]]
Sent: Friday, April 26, 2019 11:15 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>
Cc: Dawid Wysakowicz <[hidden email]>; Ken Krugler <[hidden email]>; [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

The point why this I flagged as beta is actually less about stability but more about the fact that this is supposed to be more of a "power user" feature because bad things can happen if your data is not 100% correctly partitioned in the same way as Flink would partition it. This is why typically you should only use it if the data was partitioned by Flink and you are very sure what your are doing, because the is not really something we can to at the API level to protect you from mistakes in using this feature. Eventually some runtime exceptions might show you that something is going wrong, but that is not exactly a good user experience.

 

On a different note, there actually is currently one open issue [1] to be aware of in connection with this feature and operator chaining, but at the same time this is something that should not hard to fix in for the next minor release.

 

Best,

Stefan

 

[1] https://issues.apache.org/jira/browse/FLINK-12296?focusedCommentId=16824945&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16824945  



On 26. Apr 2019, at 09:48, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hi,

 

Dawid, great, thanks!

Any plans to make it stable? 1.9?

 

 

Regards,

Sergey

 

From: Dawid Wysakowicz [[hidden email]] 
Sent: Thursday, April 25, 2019 10:54 AM
To: Smirnov Sergey Vladimirovich (39833) <[hidden email]>; Ken Krugler <[hidden email]>
Cc: [hidden email]; [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:

Hi Ken,

 

It’s a bad story for us: even for a small window we have a dozens of thousands events per job with 10x in peaks or even more. And the number of jobs was known to be high. So instead of N operations (our producer/consumer mechanism) with shuffle/resorting (current flink realization) it will be N*ln(N) - the tenfold loss of execution speed!

4 all, my next step? Contribute to apache flink? Issues backlog?

 

 

With best regards,

Sergey

From: Ken Krugler [[hidden email]] 
Sent: Wednesday, April 17, 2019 9:23 PM
To: Smirnov Sergey Vladimirovich (39833) [hidden email]
Subject: Re: kafka partitions, data locality

 

Hi Sergey,

 

As you surmised, once you do a keyBy/max on the Kafka topic, to group by clientId and find the max, then the topology will have a partition/shuffle to it.

 

This is because Flink doesn’t know that client ids don’t span Kafka partitions.

 

I don’t know of any way to tell Flink that the data doesn’t need to be shuffled. There was a discussion about adding a keyByWithoutPartitioning a while back, but I don’t think that support was ever added.

 

A simple ProcessFunction with MapState (clientId -> max) should allow you do to the same thing without too much custom code. In order to support windowing, you’d use triggers to flush state/emit results.

 

— Ken

 

 

On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) <[hidden email]> wrote:

 

Hello,

 

We planning to use apache flink as a core component of our new streaming system for internal processes (finance, banking business) based on apache kafka.

So we starting some research with apache flink and one of the question, arises during that work, is how flink handle with data locality.

I`ll try to explain: suppose we have a kafka topic with some kind of events. And this events groups by topic partitions so that the handler (or a job worker), consuming message from a partition, have all necessary information for further processing. 

As an example, say we have client’s payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one same kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data.

And my question is how flink will work in this case. Do it shuffle all data, or it have some settings to avoid this extra unnecessary shuffle/sorting operations?

Thanks in advance!

 

 

With best regards,

Sergey Smirnov

 

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra