Hello all,
I was wondering if someone would be kind enough to enlighten me on a few topics. We are trying to join two streams of data on a key. We were thinking of partitioning topics in Kafka by the key, however I also saw that Flink is able to partition on its own and I was wondering whether Flink can take advantage of Kafka's partitioning and/or which partitioning scheme should I go for? As far as joins, our two datasets are very large (millions of records in each window) and we need to perform the joins very quickly (less than 1 sec). Would the built in join mechanism be sufficient for this? From what I understand it would need to shuffle data and thus be slow on large datasets? I was wondering if there is a way to join via state key value lookups to avoid the shuffling? I read the docs and the blogs so far, thus have some limited understanding of how Flink works, no practical experience though. Thanks |
Hi Alex! Right now, Flink would not reuse Kafka's partitioning for joins, but shuffle/partition data by itself. Flink is very fast at shuffling and adds very little latency on shuffles, so that is usually not an issue. The reason that design is that we view streaming program as something dynamic: Kafka partitions may be added or removed during a program's life time, and the parallelism (and with that the partitioning scheme) can change as well. With Flink handling the partitioning internally, these cases are all covered. Concerning the join: The built-in join is definitely able to handle millions or records in a window, and scales well. What it does is windowing the streams together and joining within the windows. If you want responses within a second, you should make the window small enough that it evaluated every 500ms or so. If you want super low latency joins, you can look into using custom state to do that. With that, you could build your custom symmetric hash join for example. That has virtually zero latency and you can control how long each side keeps the data. Concerning key lookups vs shuffling: The shuffle variant is usually much faster, because it uses network better. The shuffle is fully pipelined, many records are in shuffle at the same time, it is optimized for throughput and can still keep the latency quite low (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/). In contrast, key lookups that avoid shuffling usually take a bit of time (millisecond or so) and limit any throughput a lot because they involve many smaller messages and add even more latency (roundtrip between nodes, rather than one way). Hope that this answers your question, let me know if you have more questions! Greetings, Stephan On Fri, Dec 11, 2015 at 4:00 PM, Alex Rovner <[hidden email]> wrote:
|
Thank you Stephan for the information! On Mon, Dec 14, 2015 at 5:20 AM Stephan Ewen <[hidden email]> wrote:
-- Alex Rovner Director, Data Engineering o: 646.759.0052 |
Free forum by Nabble | Edit this page |