(DEPRECATED) Apache Flink User Mailing List archive.

UniqueKey constraint is lost with multiple sources join in SQL

Classic

List

Threaded

2 messages Options

Kai Fu

UniqueKey constraint is lost with multiple sources join in SQL

Hi team,

We have a use case to join multiple data sources to generate a continuous updated view. We defined primary key constraint on all the input sources and all the keys are the subsets in the join condition. All joins are left join.

In our case, the first two inputs can produce JoinKeyContainsUniqueKey input sepc, which is good and performant. While when it comes to the third input source, it's joined with the intermediate output table of the first two input tables, and the intermediate table does not carry key constraint information(although the thrid source input table does), so it results in a NoUniqueKey input sepc. Given NoUniqueKey inputs has dramatic performance implications per the Force Join Unique Key email thread, we want to know if there is any mitigation plan for this.

One solution I can come up with is to write the intermediate result into some place like Kafka with unique constraint and join with the third source, while it requires extra resources. Any other suggestion on this? Thanks.

Best regards,

- Kai

Kai Fu

Re: UniqueKey constraint is lost with multiple sources join in SQL

As identified with the community, it's bug and more information in issue https://issues.apache.org/jira/browse/FLINK-22113

On Sat, Apr 3, 2021 at 8:43 PM Kai Fu <[hidden email]> wrote:

Hi team,

We have a use case to join multiple data sources to generate a continuous updated view. We defined primary key constraint on all the input sources and all the keys are the subsets in the join condition. All joins are left join.

In our case, the first two inputs can produce JoinKeyContainsUniqueKey input sepc, which is good and performant. While when it comes to the third input source, it's joined with the intermediate output table of the first two input tables, and the intermediate table does not carry key constraint information(although the thrid source input table does), so it results in a NoUniqueKey input sepc. Given NoUniqueKey inputs has dramatic performance implications per the Force Join Unique Key email thread, we want to know if there is any mitigation plan for this.

One solution I can come up with is to write the intermediate result into some place like Kafka with unique constraint and join with the third source, while it requires extra resources. Any other suggestion on this? Thanks.

--
Best regards,
- Kai

Best wishes,

- Kai