Scheduling of GroupByKey and CombinePerKey operations
Posted by
Pawel Bartoszek on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Scheduling-of-GroupByKey-and-CombinePerKey-operations-tp17864.html
Can I ask why some operations run only one slot? I understand that file writes should happen only one one slot but GroupByKey operation could be distributed across all slots. I am having around 20k distinct keys every minute. Is there any way to break this operator chain?
I noticed that CombinePerKey operations that don't have IO related transformation are scheduled across all 32 slots.
My cluster has 32 slots across 2 task managers. Running Beam 2.2. and Flink 1.3.2
2018-01-18, 13:56:28 | 2018-01-18, 14:37:14 | 40m 45s | GroupByKey -> ParMultiDo(WriteShardedBundles) -> ParMultiDo(Anonymous) -> xxx.pipeline.output.io.file.WriteWindowToFile-SumPlaybackBitrateResult2/TextIO.Write/WriteFiles/Reshuffle/Window.Into()/Window.Assign.out -> ParMultiDo(ReifyValueTimestamp) -> ToKeyedWorkItem | 149 MB | 333,672 | 70.8 MB | 19 | 32 | 00320000 | RUNNING |
Start Time | End Time | Duration | Bytes received | Records received | Bytes sent | Records sent | Attempt | Host | Status |
---|
2018-01-18, 13:56:28 | | 40m 45s | 2.30 MB | 0 | 2.21 MB | 0 | 1 | xxx | RUNNING | 2018-01-18, 13:56:28 | | 40m 45s | 2.30 MB | 0 | 2.21 MB | 0 | 1 | xxx | RUNNING | 2018-01-18, 13:56:28 | | 40m 45s | 2.30 MB | 0 | 2.21 MB | 0 | 1 | xxx | RUNNING | 2018-01-18, 13:56:28 | | 40m 45s | 77.5 MB | 333,683 | 2.21 MB | 20 | 1 | xxx | RUNNING | 2018-01-18, 13:56:28 | | 40m 45s | 2.30 MB | 0 | 2.21 MB | 0 | 1 | xxx | RUNNING | 2018-01-18, 13:56:28 | | 40m 45s | 2.30 MB | 0 | 2.21 MB | 0 | 1 | xxx | RUNNING |
|
Thanks,
Pawel