DataStream joining without window

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataStream joining without window

Yan Zhou [FDS Science] ­
It seems like flink only supports DataStream joining within same time window. Why is it restricted in this way? 

I think I can implement a TwoInputStreamOperator to join two DataStreams without considering the window.  And inside the operator, create two state to cache records of two streams and join the streams within methods processElement1/processElement2. Should I go head with this approach? Is there any performance consideration here? If the concern is that the cache might take a lot of memory, we can introduce some cache policy and reduce the size. Or can we use rocksDB state?

Please advise.

Best
Yan
 
Reply | Threaded
Open this post in threaded view
|

Re: DataStream joining without window

Aljoscha Krettek
Hi,

Yes, using a TwoInputStreamOperator (or even better, a CoProcessFunction, because TwoInputStreamOperator is a low-level interface that might change in the future) is the recommended way for implementing a stream-stream join, currently.

As you already guessed, you need a policy for cleanup up the state that you hold. You can do this using the timer features of CoProcessFunction.

Also, if you keep your buffered elements using the Flink state interfaces you can switch the state backend to the RocksDB backend and if you have concerns about the state growing too big.

Best,
Aljoscha

> On 9. Oct 2017, at 23:11, Yan Zhou [FDS Science] ­ <[hidden email]> wrote:
>
> It seems like flink only supports DataStream joining within same time window. Why is it restricted in this way?
>
> I think I can implement a TwoInputStreamOperator to join two DataStreams without considering the window.  And inside the operator, create two state to cache records of two streams and join the streams within methods processElement1/processElement2. Should I go head with this approach? Is there any performance consideration here? If the concern is that the cache might take a lot of memory, we can introduce some cache policy and reduce the size. Or can we use rocksDB state?
>
> Please advise.
>
> Best
> Yan
>  

Reply | Threaded
Open this post in threaded view
|

Re: DataStream joining without window

Yan Zhou [FDS Science] ­
Thank you for the reply. It's very helpful.

Best
Yan

On Tue, Oct 10, 2017 at 7:57 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,

Yes, using a TwoInputStreamOperator (or even better, a CoProcessFunction, because TwoInputStreamOperator is a low-level interface that might change in the future) is the recommended way for implementing a stream-stream join, currently.

As you already guessed, you need a policy for cleanup up the state that you hold. You can do this using the timer features of CoProcessFunction.

Also, if you keep your buffered elements using the Flink state interfaces you can switch the state backend to the RocksDB backend and if you have concerns about the state growing too big.

Best,
Aljoscha

> On 9. Oct 2017, at 23:11, Yan Zhou [FDS Science] ­ <[hidden email]> wrote:
>
> It seems like flink only supports DataStream joining within same time window. Why is it restricted in this way?
>
> I think I can implement a TwoInputStreamOperator to join two DataStreams without considering the window.  And inside the operator, create two state to cache records of two streams and join the streams within methods processElement1/processElement2. Should I go head with this approach? Is there any performance consideration here? If the concern is that the cache might take a lot of memory, we can introduce some cache policy and reduce the size. Or can we use rocksDB state?
>
> Please advise.
>
> Best
> Yan
>