Hello all!
I'm trying to design a stream pipeline, and have trouble controlling when a JOIN is triggering an update: Setup:
SELECT * FROM Event e LEFT JOIN DimensionAtJoinTime1 d1 ON e.uid = d1.uid LEFT JOIN DimensionAtJoinTime2 d2 ON e.uid = d2.uid The DimensionAtJoinTimeX Tables being the result of earlier stream processing, possibly from the same Event table: SELECT uid, hop_start(...), sum(...) FROM Event e GROUP BY uid, hop(...) The Event Table being: SELECT ... FROM EventRawInput i WHERE i.some_field = 'some_value' Requirements:
Cheers, and a happy new year to all, Benoît [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/temporal_tables.html#temporal-table [2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/joins.html#processing-time-temporal-joins [3] https://issues.apache.org/jira/browse/FLINK-15112 [4] https://issues.apache.org/jira/browse/FLINK-14200 [5] https://arxiv.org/pdf/1905.12133.pdf |
Hi Benoît, Before discussing all the options you listed, I'd like understand more about your requirements. The part I don't fully understand is, both your fact (Event) and dimension (DimensionAtJoinTimeX) tables are coming from the same table, Event or EventRawInput in your case. So it will result that both your fact and dimension tables are changing with time. My understanding is, when your DimensionAtJoinTimeX table emit the results, you don't want to change the result again. You want the fact table only join whatever data currently the dimension table have? I'm asking because your dimension table was calculated with a window aggregation, but your join logic seems doesn't care about the time attribute (LEFT JOIN DimensionAtJoinTime1 d1 ON e.uid = d1.uid). It's possible that when a record with uid=x comes from Event table, but the dimension table doesn't have any data around uid=x yet due to the window aggregation. In this case, you don't want them to join? Best, Kurt On Fri, Jan 3, 2020 at 1:11 AM Benoît Paris <[hidden email]> wrote:
|
Hi Kurt, Thank you for your answer. Yes both fact tables and dimension tables are changing over time; it was to illustrate that they could change at the same time but that we could still make a JOIN basically ignore updates from one specified side. The SQL is not the actual one I'm using, and as you have said later on, I indeed don't deal with a time attribute and just want what's in the table at processing time. At the moment my problem seems to be in good way of being resolved, and it is going to be Option 4: "LATERAL TABLE table_function" on the Blink planner; as Jark Wu seems to be -elegantly- providing a patch for the FLINK-14200 NPE bug: It was indeed about shenanigans on finding the proper RelOptSchema; Ah, I wish I had dived sooner in the source code, and I could have had the pleasure opportunity to contribute to the Flink codebase. Anyway, shout out to Jark for resolving the bug and providing a patch! I believe this will be a real enabler for CQRS architectures on Flink (we had subscriptions with regular joins, and this patch enables querying the same thing with very minor SQL modifications) Kind regards Benoît On Sat, Jan 4, 2020 at 4:22 AM Kurt Young <[hidden email]> wrote:
|
Good to hear that the patch resolved your issue, looking forward to hearing more feedback from you! Best, Kurt On Mon, Jan 6, 2020 at 5:56 AM Benoît Paris <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |