Re: streaming join implementation

Posted by Balaji Rajagopalan on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/streaming-join-implementation-tp6095p6098.html

Let me give you specific example, say stream1 event1 happened within your window 0-5 min with key1, and event2 on stream2 with key2 which could have matched with key1 happened at 5:01 outside the join window, so now you will have to co-relate the event2 on stream2 with the event1 with stream1 which has happened on the previous window, this was the corner case I mentioned before. I am not aware if flink can solve this problem for you, that would be nice, instead of solving this in application. 

On Thu, Apr 14, 2016 at 12:10 PM, Henry Cai <[hidden email]> wrote:
Thanks Balaji.  Do you mean you spill the non-matching records after 5 minutes into redis?  Does flink give you control on which records is not matching in the current window such that you can copy into a long-term storage?



On Wed, Apr 13, 2016 at 11:20 PM, Balaji Rajagopalan <[hidden email]> wrote:
You can implement join in flink (which is a inner join) the below mentioned pseudo code . The below join is for a 5 minute interval, yes will be some corners cases when the data coming after 5 minutes will be  missed out in the join window, I actually had solved this problem but storing some data in redis and wrote correlation logic to take care of the corner cases that were missed out in the join  window. 

val output: DataStream[(OutputData)] = stream1.join(stream2).where(_.key1).equalTo(_.key2).
window(TumblingEventTimeWindows.of(Time.of(5, TimeUnit.MINUTE))).apply(new SomeJoinFunction)

On Thu, Apr 14, 2016 at 10:02 AM, Henry Cai <[hidden email]> wrote:
Hi,

We are evaluating different streaming platforms.  For a typical join between two streams

select a.*, b.*
FROM a, b
ON a.id == b.id

How does flink implement the join?  The matching record from either stream can come late, we consider it's a valid join as long as the event time for record a and b are in the same day.

I think some streaming platform (e.g. google data flow) will store the records from both streams in a K/V lookup store and later do the lookup.  Is this how flink implement the streaming join?

If we need to store all the records in a state store, that's going to be a lots of records for a day.