Hi guys,
I want to know if it's possible to sort events in a flink data stream. I know I can't sort a stream but is there a way in which I can buffer for a very short time and sort those events before sending it to a data sink. In our scenario we consume from a kafka topic which has multiple partitions but the data in these brokers are not partitioned by a key(its round robin) , for example we want to time order transactions associated with a particular account but since the same account number ends up in different partitions at the source for different transactions we are not able to maintain event time order in our stream processing system since the same account number ends up in different task managers and slots. We do however partition by account number when we send the events to downstream kafka sink so that transactions from the same account number end up in the same partition. This is however not good enough since the events are not sorted at the source. Any ideas for doing this is much appreciated. Best, Vishwas |
Yes, it's possible to sort a stream by the event time timestamps on
the events. This depends on having reliable watermarks -- as events that are late can not be emitted in order. There are several ways to accomplish this. You could, for example, use a ProcessFunction, and implement the sorting yourself, or use a Window, or take advantage of the sorting that's already been implemented in either the SQL or CEP libraries. For example of doing this with windows, see https://stackoverflow.com/questions/58539379/guarantee-of-event-time-order-in-flinkkafkaconsumer. For an implementation using SQL, see https://stackoverflow.com/a/54970489/2000823. For an implementation using the CEP library, including the use of side outputs for late events, see https://github.com/ververica/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/examples/datastream_java/cep/Sort.java. Best, David On Fri, Nov 1, 2019 at 10:21 PM Vishwas Siravara <[hidden email]> wrote: > > Hi guys, > I want to know if it's possible to sort events in a flink data stream. I know I can't sort a stream but is there a way in which I can buffer for a very short time and sort those events before sending it to a data sink. > > In our scenario we consume from a kafka topic which has multiple partitions but the data in these brokers are not partitioned by a key(its round robin) , for example we want to time order transactions associated with a particular account but since the same account number ends up in different partitions at the source for different transactions we are not able to maintain event time order in our stream processing system since the same account number ends up in different task managers and slots. We do however partition by account number when we send the events to downstream kafka sink so that transactions from the same account number end up in the same partition. This is however not good enough since the events are not sorted at the source. > > Any ideas for doing this is much appreciated. > > > Best, > Vishwas |
Free forum by Nabble | Edit this page |