Batch Task Synchronization

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch Task Synchronization

Maria Xekalaki
Hi All,

This is more of a general question. How are tasks synchronized in batch execution? If, for example, we ran an iterative pipeline (map1 -> reduce1 -> reduce2 -> map2), and the first two operators (map1->reduce1) were chained, how would reduce2 be notified that map1 -> reduce1 have completed their execution so as to start reading its input data? I noticed that in the driver classes (MapDriver, ChainedReduceDriver etc.) there are input and output counters (numRecordsOut, numRecordsIn). Are these used to check if an operator has consumed all of its data? 

Thank you in advance.

Best Wishes,
Mary 
Reply | Threaded
Open this post in threaded view
|

Re: Batch Task Synchronization

Guowei Ma
Hi, Mary
     Flink has an alignment mechanism for synchronization. All upstream taks (for example reduce1) will send a message after the end of a round 
     to inform all downstream that he has processed all the data. When the downstream (reduce2) collected all the messages from all his upstream tasks, 
     it(reduce2) knew that all the data was collected. After that, it(reduce2) could process all its inputs.
     Hope it helps you.
Best,
Guowei


On Mon, Apr 19, 2021 at 5:17 PM Maria Xekalaki <[hidden email]> wrote:
Hi All,

This is more of a general question. How are tasks synchronized in batch execution? If, for example, we ran an iterative pipeline (map1 -> reduce1 -> reduce2 -> map2), and the first two operators (map1->reduce1) were chained, how would reduce2 be notified that map1 -> reduce1 have completed their execution so as to start reading its input data? I noticed that in the driver classes (MapDriver, ChainedReduceDriver etc.) there are input and output counters (numRecordsOut, numRecordsIn). Are these used to check if an operator has consumed all of its data? 

Thank you in advance.

Best Wishes,
Mary