Hi Marco,
It basically works like this for windows:
- For any incoming record, calculate the respective window based on the event timestamp (ts). Let's assume a tumbling window for now, then we calculate by ts / window size (simplified).
- This means that at any given time, there could be an arbitrary number of open windows.
- Now a watermark comes in. The watermark tells the window operator: no elements with a lower ts can come at this point.
- The window operator closes all windows before that point in time. For each window: the aggregations are performed and the output is generated. The window is usually evicted at this point in time.
- A record coming after the respective window is closed is discarded.
I don't quite understand if you have 2 process functions or 1 process function with a window, but note that any process function directly after a window is merged into one task. So aggregation happens right away. This ProcessWindowFunction gets an Iterable of all elements. The elements are sorted by their arrival time. If you want to receive the bounds, you can either select first and last on an ordered stream or min/max.
On Mon, Apr 19, 2021 at 9:24 PM Marco Villalobos <
[hidden email]> wrote:
I have a tumbling window that aggregates into a process window function. Downstream there is a keyed process function.
[window aggregate into process function] -> keyed process function
I am not quite sure how the keyed process knows which elements are at the boundary of the window. Is there a means to communicate that?
Are watermarks the means by which we signal that either processing time or event time has finished an interval?
Is it a watermark, that can be used to signal to the downstream operators the demarcating events?
Are there other ways to do that?