coordination of sinks

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

coordination of sinks

Marco Villalobos-2
Given a source that goes into a tumbling window with a process function that yields two side outputs, in addition to the main data stream, is it possible to coordinate the order of completion
of sink 1, sink 2, and sink 3 as data leaves the tumbling window?

source -> tumbling window -> process function -> side output tag 1 -> sink 1                                               \-> side output tag 2 -> sink 2
                                             \-> main stream -> sink 3
                        

sink 1 will create partitions in PostgreSQL for me.
sink 2 will insert data into the partitioned table
sink 3 can happen in any order
but all of them need to finish before the next window fires. 

Any advice will help.
Reply | Threaded
Open this post in threaded view
|

Re: coordination of sinks

Fabian Hueske-2
Hi Marco,

You cannot really synchronize data that is being emitted via different streams (without bringing them together in an operator).

I see two options:

1) emit the event to create the partition and the data to be written into the partition to the same stream. Flink guarantees that records do not overtake records in the same partition. However, you need to ensure that all records remain in the same partition, for example by partitioning on the same ke.
2) emit the records to two different streams but have a CoProcessFunction that processes the create partition and data events. The processing function would just buffer the data events (in state) until it observes the create partition event for which it creates the partitions (in a synchronous fashion). Once the partition is created, it forwards all buffered data and the remaining data.

Hope this helps,
Fabian

Am Sa., 15. Aug. 2020 um 07:45 Uhr schrieb Marco Villalobos <[hidden email]>:
Given a source that goes into a tumbling window with a process function that yields two side outputs, in addition to the main data stream, is it possible to coordinate the order of completion
of sink 1, sink 2, and sink 3 as data leaves the tumbling window?

source -> tumbling window -> process function -> side output tag 1 -> sink 1                                               \-> side output tag 2 -> sink 2
                                             \-> main stream -> sink 3
                        

sink 1 will create partitions in PostgreSQL for me.
sink 2 will insert data into the partitioned table
sink 3 can happen in any order
but all of them need to finish before the next window fires. 

Any advice will help.