I am using a simple streaming job where I use keyBy on the stream to process events per key. The keys may vary in number (few keys to thousands). I have noticed a behavior of Flink and I need clarification on that. When we use keyBy on the stream, flink assigns keys to parallel operators so each operator can handle events per key independently. Once a key is assigned to an operator, can the key change the operator on which it is assigned? From what I`ve seen the answer is no.
For example, let`s assume that keys 1 and 2 are assigned to operator A and keys 3 and 4 are assigned to operator B. If there is a burst of data for key 1 at some later time point, but keys 2,3 and 4 have only few data will key 2 be assigned to operator B to balance the load? If not is there a way to do that? And again if not, why flink does not do that? |
Hi,
using keyBy Flink ensures that every set of records with same key is send to the same operator, otherwise it would not be possible to process them as a whole. It depends on your use case if it is also ok that another operator processes parts of this set of records. You can implement you own partition strategy to split your data more evenly (https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning). But this depends on your knowledge of key spaces and load, Flink can not know this in advance. I hope that helps. Regards, Timo Am 20/03/17 um 15:29 schrieb Sonex: > I am using a simple streaming job where I use keyBy on the stream to process > events per key. The keys may vary in number (few keys to thousands). I have > noticed a behavior of Flink and I need clarification on that. When we use > keyBy on the stream, flink assigns keys to parallel operators so each > operator can handle events per key independently. Once a key is assigned to > an operator, can the key change the operator on which it is assigned? From > what I`ve seen the answer is no. > > For example, let`s assume that keys 1 and 2 are assigned to operator A and > keys 3 and 4 are assigned to operator B. If there is a burst of data for key > 1 at some later time point, but keys 2,3 and 4 have only few data will key 2 > be assigned to operator B to balance the load? If not is there a way to do > that? And again if not, why flink does not do that? > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/load-balancing-of-keys-to-operators-tp12303.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
Thanx for your response.
When using time windows, doesn`t flink know the load per window? I have observed this behavior in windows as well. |
I think it very depends on your use case, maybe you can use combiner
first to reduce the amount of records per key. Maybe you can explain your application a bit more (which window, type of aggregations). It often helps e.g. to introduce an artifical key und merge the result of multiple windows in a downstream operator. Am 20/03/17 um 18:25 schrieb Sonex: > Thanx for your response. > > When using time windows, doesn`t flink know the load per window? I have > observed this behavior in windows as well. > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/load-balancing-of-keys-to-operators-tp12303p12308.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
Free forum by Nabble | Edit this page |