(DEPRECATED) Apache Flink User Mailing List archive.

Even key distribution workload

Classic

List

Threaded

3 messages Options

Navneeth Krishnan

Even key distribution workload

Hi All,

Currently I have a keyBy user and I see uneven load distribution since some of the users would have very high load versus some users having very few messages. Is there a recommended way to achieve even distribution of workload? Has someone else encountered this problem and what was the workaround?

Thanks

Caizhi Weng

Re: Even key distribution workload

Hi Navneeth,

Is it possible for you to first keyBy something other than user id (for example, message id), and then aggregate the message of the same user in the same keyed stream, and finally aggregate all the keyed stream to get a per-user result?

Navneeth Krishnan <[hidden email]> 于2019年7月15日周一下午2:38写道：

Hi All,

Currently I have a keyBy user and I see uneven load distribution since some of the users would have very high load versus some users having very few messages. Is there a recommended way to achieve even distribution of workload? Has someone else encountered this problem and what was the workaround?

Thanks

Biao Liu

Re: Even key distribution workload

In reply to this post by Navneeth Krishnan

Hi Navneeth,

The "keyby" semantics must keep the data under same key into same task. So basically this data skew issue is caused by your data distribution.

As far as I known, Flink could not handle data skew very well. There is a proposal about local aggregation which is still under discussion in dev mailing list. It can alleviate the data skew. But I guess it still need some time.

As Caizhi mentioned, it's better to do something in user codes as a workaround solution. For example, redistribute the skew data.

Navneeth Krishnan <[hidden email]> 于2019年7月15日周一下午2:38写道：

Hi All,

Currently I have a keyBy user and I see uneven load distribution since some of the users would have very high load versus some users having very few messages. Is there a recommended way to achieve even distribution of workload? Has someone else encountered this problem and what was the workaround?

Thanks