Re: How to stream intermediate data that is stored in external storage?

Posted by kant kodali on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/How-to-stream-intermediate-data-that-is-stored-in-external-storage-tp30748p30824.html

Hi Huyen,

That is not my problem statement.  If it is just ingesting from A to B I am sure there are enough tutorials for me to get it done. I also feel like the more I elaborate the more confusing it gets and I am not sure why.

I want to join two streams and I want to see/query the results of that join in real-time but I also have some constraints.

I come from Spark world and in Spark, if I do an inner join of two streams A & B  then I can see results only when there are matching rows between A & B (By definition of inner join this makes sense) but if I do an outer join of two streams A & B I need to specify the time constraint and only when the time elapses fully I can see the rows that did not match. This means if my time constraint is one hour I cannot query the intermediate results until one hour and this is not the behavior I want. I want to be able to do all sorts of SQL queries on intermediate results.

I like Flink's idea of externalizing the state that way I don't have to worry about the memory but I am also trying to avoid writing a separate microservice that needs to poll and display the intermediate results of the join in real-time. Instead, I am trying to see if there is a way to treat that constantly evolving intermediate results as a streaming source, and maybe do some more transformations and push out to another sink.

Hope that makes sense.

Thanks,
Kant



On Thu, Oct 31, 2019 at 2:43 AM Huyen Levan <[hidden email]> wrote:
Hi Kant,

So your problem statement is "ingest 2 streams into a data warehouse". The main component of the solution, from my view, is that SQL server. You can have a sink function to insert records in your two streams into two different tables (A and B), or upsert into one single table C. That upsert action itself serves as a join function, there's no need to join in Flink at all.

There are many tools out there can be used for that ingestion. Flink, of course, can be used for that purpose. But for me, it's an overkill.

Regards,
Averell

On Thu, 31 Oct. 2019, 8:19 pm kant kodali, <[hidden email]> wrote:
Hi Averell,

yes,  I want to run ad-hoc SQL queries on the joined data as well as data that may join in the future. For example, let's say if you take datasets A and B in streaming mode a row in A can join with a row B in some time in future let's say but meanwhile if I query the intermediate state using SQL I want the row in A that have not yet joined with B to also be available to Query. so not just joined results alone but also data that might be join in the future as well since its all streaming.

Thanks!