Back pressure with multiple joins

classic Classic list List threaded Threaded
4 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Back pressure with multiple joins

Dan
Hi!

My goal is to better understand how my code impacts streaming throughput.

I have a streaming job where I join multiple tables (A, B, C, D) using interval joins.

Case 1) If I have 3 joins in the same query, I don't hit back pressure.

SELECT ...
FROM A
LEFT JOIN B
ON...
LEFT JOIN C
ON...
LEFT JOIN D
ON...

Case 2) If I create temporary views for two of the joins (for reuse with another query), I hit back a lot of back pressure.  This is selecting slightly more fields than the first.

CREATE TEMPORARY VIEW `AB`
SELECT ...
FROM A
LEFT JOIN B
...

CREATE TEMPORARY VIEW `ABC`
SELECT ...
FROM AB
LEFT JOIN C
...


Can Temporary Views increase back pressure?

If A, B, C and D are roughly the same size (fake data), does the join order matter?  E.g. I assume reducing the size of the columns in each join stage would help.

Thanks!
- Dan


Dan
Reply | Threaded
Open this post in threaded view
|

Re: Back pressure with multiple joins

Dan
When I use DataStream and implement the join myself, I can get 50x the throughput.  I assume I'm doing something wrong with Flink's Table API and SQL interface.

On Tue, Sep 22, 2020 at 11:21 PM Dan Hill <[hidden email]> wrote:
Hi!

My goal is to better understand how my code impacts streaming throughput.

I have a streaming job where I join multiple tables (A, B, C, D) using interval joins.

Case 1) If I have 3 joins in the same query, I don't hit back pressure.

SELECT ...
FROM A
LEFT JOIN B
ON...
LEFT JOIN C
ON...
LEFT JOIN D
ON...

Case 2) If I create temporary views for two of the joins (for reuse with another query), I hit back a lot of back pressure.  This is selecting slightly more fields than the first.

CREATE TEMPORARY VIEW `AB`
SELECT ...
FROM A
LEFT JOIN B
...

CREATE TEMPORARY VIEW `ABC`
SELECT ...
FROM AB
LEFT JOIN C
...


Can Temporary Views increase back pressure?

If A, B, C and D are roughly the same size (fake data), does the join order matter?  E.g. I assume reducing the size of the columns in each join stage would help.

Thanks!
- Dan


Reply | Threaded
Open this post in threaded view
|

Re: Back pressure with multiple joins

Timo Walther
Hi Dan,

could you share the plan with us using `TableEnvironment.explainSql()`
for both queries?

In general, views should not have an impact on the performance. They are
a logical concept that gives a bunch of operations a name. The contained
operations are inlined into the bigger query during optimization.

Unless you execute multiple queries in a StatementSet, the the data is
read twice from the source. How do you execute the SQL stataments?

For deterministic plans, join reordering is disabled by default. You can
set it via:

org.apache.flink.table.api.config.OptimizerConfigOptions#TABLE_OPTIMIZER_JOIN_REORDER_ENABLED

Regards,
Timo

On 23.09.20 08:23, Dan Hill wrote:

> When I use DataStream and implement the join myself, I can get 50x the
> throughput.  I assume I'm doing something wrong with Flink's Table API
> and SQL interface.
>
> On Tue, Sep 22, 2020 at 11:21 PM Dan Hill <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi!
>
>     My goal is to better understand how my code impacts streaming
>     throughput.
>
>     I have a streaming job where I join multiple tables (A, B, C, D)
>     using interval joins.
>
>     Case 1) If I have 3 joins in the same query, I don't hit back pressure.
>
>         SELECT ...
>         FROM A
>         LEFT JOIN B
>         ON...
>         LEFT JOIN C
>         ON...
>         LEFT JOIN D
>         ON...
>
>
>     Case 2) If I create temporary views for two of the joins (for reuse
>     with another query), I hit back a lot of back pressure.  This is
>     selecting slightly more fields than the first.
>
>         CREATE TEMPORARY VIEW `AB`
>
>         SELECT ...
>         FROM A
>         LEFT JOIN B
>         ...
>
>         CREATE TEMPORARY VIEW `ABC`
>         SELECT ...
>         FROM AB
>         LEFT JOIN C
>         ...
>
>
>
>     Can Temporary Views increase back pressure?
>
>     If A, B, C and D are roughly the same size (fake data), does the
>     join order matter?  E.g. I assume reducing the size of the columns
>     in each join stage would help.
>
>     Thanks!
>     - Dan
>
>

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Back pressure with multiple joins

Dan
I can't reproduce the issue now.  I'm using the same commits so I'm guessing another environment factor is at play.  The explains are pretty similar (there's an extra LogicalProject when there is an extra view).

On Fri, Sep 25, 2020 at 12:55 AM Timo Walther <[hidden email]> wrote:
Hi Dan,

could you share the plan with us using `TableEnvironment.explainSql()`
for both queries?

In general, views should not have an impact on the performance. They are
a logical concept that gives a bunch of operations a name. The contained
operations are inlined into the bigger query during optimization.

Unless you execute multiple queries in a StatementSet, the the data is
read twice from the source. How do you execute the SQL stataments?

For deterministic plans, join reordering is disabled by default. You can
set it via:

org.apache.flink.table.api.config.OptimizerConfigOptions#TABLE_OPTIMIZER_JOIN_REORDER_ENABLED

Regards,
Timo

On 23.09.20 08:23, Dan Hill wrote:
> When I use DataStream and implement the join myself, I can get 50x the
> throughput.  I assume I'm doing something wrong with Flink's Table API
> and SQL interface.
>
> On Tue, Sep 22, 2020 at 11:21 PM Dan Hill <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi!
>
>     My goal is to better understand how my code impacts streaming
>     throughput.
>
>     I have a streaming job where I join multiple tables (A, B, C, D)
>     using interval joins.
>
>     Case 1) If I have 3 joins in the same query, I don't hit back pressure.
>
>         SELECT ...
>         FROM A
>         LEFT JOIN B
>         ON...
>         LEFT JOIN C
>         ON...
>         LEFT JOIN D
>         ON...
>
>
>     Case 2) If I create temporary views for two of the joins (for reuse
>     with another query), I hit back a lot of back pressure.  This is
>     selecting slightly more fields than the first.
>
>         CREATE TEMPORARY VIEW `AB`
>
>         SELECT ...
>         FROM A
>         LEFT JOIN B
>         ...
>
>         CREATE TEMPORARY VIEW `ABC`
>         SELECT ...
>         FROM AB
>         LEFT JOIN C
>         ...
>
>
>
>     Can Temporary Views increase back pressure?
>
>     If A, B, C and D are roughly the same size (fake data), does the
>     join order matter?  E.g. I assume reducing the size of the columns
>     in each join stage would help.
>
>     Thanks!
>     - Dan
>
>