(DEPRECATED) Apache Flink User Mailing List archive.

Scheduling of GroupByKey and CombinePerKey operations

Posted by Pawel Bartoszek on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Scheduling-of-GroupByKey-and-CombinePerKey-operations-tp17864.html

Can I ask why some operations run only one slot? I understand that file writes should happen only one one slot but GroupByKey operation could be distributed across all slots. I am having around 20k distinct keys every minute. Is there any way to break this operator chain?

I noticed that CombinePerKey operations that don't have IO related transformation are scheduled across all 32 slots.

My cluster has 32 slots across 2 task managers. Running Beam 2.2. and Flink 1.3.2

2018-01-18, 13:56:28

2018-01-18, 14:37:14

40m 45s

GroupByKey -> ParMultiDo(WriteShardedBundles) -> ParMultiDo(Anonymous) -> xxx.pipeline.output.io.file.WriteWindowToFile-SumPlaybackBitrateResult2/TextIO.Write/WriteFiles/Reshuffle/Window.Into()/Window.Assign.out -> ParMultiDo(ReifyValueTimestamp) -> ToKeyedWorkItem

149 MB

333,672

70.8 MB

00320000

RUNNING

Start Time	Duration	Bytes received	Records received	Bytes sent	Records sent	Attempt	Host	Status
2018-01-18, 13:56:28	40m 45s	2.30 MB	0	2.21 MB	0	1	xxx	RUNNING
2018-01-18, 13:56:28	40m 45s	2.30 MB	0	2.21 MB	0	1	xxx	RUNNING
2018-01-18, 13:56:28	40m 45s	2.30 MB	0	2.21 MB	0	1	xxx	RUNNING
2018-01-18, 13:56:28	40m 45s	77.5 MB	333,683	2.21 MB	20	1	xxx	RUNNING
2018-01-18, 13:56:28	40m 45s	2.30 MB	0	2.21 MB	0	1	xxx	RUNNING
2018-01-18, 13:56:28	40m 45s	2.30 MB	0	2.21 MB	0	1	xxx	RUNNING

Thanks,
Pawel