Join Dataset in stream

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Join Dataset in stream

eric hoffmann

Hi.
I need to compute an euclidian distance between an input Vector and a full dataset stored in Cassandra and keep the n lowest value. The Cassandra dataset is evolving (mutable). I could do this on a batch job, but i will have to triger it each time and the input are more like a slow stream, but the computing need to be fast can i do this on a stream way? is there any better solution ?
Thx
Reply | Threaded
Open this post in threaded view
|

Re: Join Dataset in stream

Ken Krugler
Hi Eric,

This sounds like a use case for BroadcastProcessFunction  You’d use the Cassandra dataset as the source for the broadcast stream, which is distributed to every parallel instance of your custom BroadcastProcessFunction. The input vectors are a partitioned stream that’s the other input to this function (via its processElement() method). The two streams get connected as a BroadcastConnectedStream.

Note that as of Flink 1.5 it’s also easy to maintain the broadcast state.

— Ken

On Nov 14, 2018, at 11:32 PM, eric hoffmann <[hidden email]> wrote:


Hi.
I need to compute an euclidian distance between an input Vector and a full dataset stored in Cassandra and keep the n lowest value. The Cassandra dataset is evolving (mutable). I could do this on a batch job, but i will have to triger it each time and the input are more like a slow stream, but the computing need to be fast can i do this on a stream way? is there any better solution ?
Thx

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra