UniqueKey constraint is lost with multiple sources join in SQL

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

UniqueKey constraint is lost with multiple sources join in SQL

Kai Fu
Hi team,

We have a use case to join multiple data sources to generate a continuous updated view. We defined primary key constraint on all the input sources and all the keys are the subsets in the join condition. All joins are left join.

In our case, the first two inputs can produce JoinKeyContainsUniqueKey input sepc, which is good and performant. While when it comes to the third input source, it's joined with the intermediate output table of the first two input tables, and the intermediate table does not carry key constraint information(although the thrid source input table does), so it results in a NoUniqueKey input sepc. Given NoUniqueKey inputs has dramatic performance implications per the Force Join Unique Key email thread, we want to know if there is any mitigation plan for this.

One solution I can come up with is to write the intermediate result into some place like Kafka with unique constraint and join with the third source, while it requires extra resources. Any other suggestion on this? Thanks.

--
Best regards,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: UniqueKey constraint is lost with multiple sources join in SQL

Kai Fu
As identified with the community, it's bug and more information in issue https://issues.apache.org/jira/browse/FLINK-22113

On Sat, Apr 3, 2021 at 8:43 PM Kai Fu <[hidden email]> wrote:
Hi team,

We have a use case to join multiple data sources to generate a continuous updated view. We defined primary key constraint on all the input sources and all the keys are the subsets in the join condition. All joins are left join.

In our case, the first two inputs can produce JoinKeyContainsUniqueKey input sepc, which is good and performant. While when it comes to the third input source, it's joined with the intermediate output table of the first two input tables, and the intermediate table does not carry key constraint information(although the thrid source input table does), so it results in a NoUniqueKey input sepc. Given NoUniqueKey inputs has dramatic performance implications per the Force Join Unique Key email thread, we want to know if there is any mitigation plan for this.

One solution I can come up with is to write the intermediate result into some place like Kafka with unique constraint and join with the third source, while it requires extra resources. Any other suggestion on this? Thanks.

--
Best regards,
- Kai


--
Best wishes,
- Kai