(DEPRECATED) Apache Flink User Mailing List archive.

Working around lack of SQL triggers

Classic

List

Threaded

5 messages Options

deklanw

Working around lack of SQL triggers

I'm not sure how to express my logic simply where early triggers are a
necessity.

My application has large windows (2 weeks~) where early triggering is
absolutely required. But, also, my application has mostly relatively simple
logic which can be expressed in SQL. There's a ton of duplication, like the
following

```
SELECT A,B,C,
COUNT(*) FILTER (WHERE my_condition) AS total_conditions,
COUNT(*) AS total,
ROUND(COUNT(*) FILTER (WHERE my_condition)/(COUNT(*)), 1) AS
condition_rate,
AVG(D),
AVG(E),
AVG(F)
FROM foo
GROUP BY A,B,C, SESSION(...)
```

Just imagine these kinds of queries duplicated a ton, just varying which
fields are being averaged and grouped by.

This is fairly easy to do with SQL, with some copying and pasting. Just
Ctrl+Fing to give an idea (so far),
COUNT - 50
AVG - 27
GROUP BY - 12

Since Flink doesn't support GROUPING SETS for streaming, I'll need to
duplicate a lot of these queries actually. So this is an underestimation.

Is writing an absolute ton of custom AggregateFunction boilerplate the only
way to solve this problem? Is there no way to abstract this while
maintaining early triggers? I feel like I'm missing something. Is Flink SQL
streaming only for short windows where triggering only at the end of the
window is acceptable?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Fabian Hueske-2

Re: Working around lack of SQL triggers

Hi,

I don't think that (the current state of) Flink SQL is a good fit for your requirements.

Each query will be executed as an independent job. So there won't be any sharing of intermediate results.

You can do some of this manually if you use the Table API, but even then it won't allow for early results.

I'd recommend to use the DataStream API and some parameterizable aggregation functions.

Best, Fabian

Am Fr., 26. Apr. 2019 um 20:49 Uhr schrieb deklanw <[hidden email]>:

I'm not sure how to express my logic simply where early triggers are a
necessity.

My application has large windows (2 weeks~) where early triggering is
absolutely required. But, also, my application has mostly relatively simple
logic which can be expressed in SQL. There's a ton of duplication, like the
following

```
SELECT A,B,C,
COUNT(*) FILTER (WHERE my_condition) AS total_conditions,
COUNT(*) AS total,
ROUND(COUNT(*) FILTER (WHERE my_condition)/(COUNT(*)), 1) AS
condition_rate,
AVG(D),
AVG(E),
AVG(F)
FROM foo
GROUP BY A,B,C, SESSION(...)
```

Just imagine these kinds of queries duplicated a ton, just varying which
fields are being averaged and grouped by.

This is fairly easy to do with SQL, with some copying and pasting. Just
Ctrl+Fing to give an idea (so far),
COUNT - 50
AVG - 27
GROUP BY - 12

Since Flink doesn't support GROUPING SETS for streaming, I'll need to
duplicate a lot of these queries actually. So this is an underestimation.

Is writing an absolute ton of custom AggregateFunction boilerplate the only
way to solve this problem? Is there no way to abstract this while
maintaining early triggers? I feel like I'm missing something. Is Flink SQL
streaming only for short windows where triggering only at the end of the
window is acceptable?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

deklanw

Re: Working around lack of SQL triggers

Hi,

Thanks for the reply.

I had already almost completely lost hope in using Flink SQL. You have
confirmed that.

But, like I said, I don't know how to reduce the large amount of boilerplate
I foresee this requiring with the DataStream API. Can you help me with that?
You mention "parameterizable aggregation functions", can you show an
example? I don't know how to do this without reinventing AVG and COUNT over
and over again.

Thanks

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Fabian Hueske-2

Re: Working around lack of SQL triggers

You could implement aggregation functions that just do AVG, COUNT, etc. and a parameterizable aggregation function that can be configured to call the avg, count, etc. functions.

When configuring, you would specify the input and output, for example like this:

input: [int, int, double]

key: input.1

output: [key, count(*), sum(input.2), avg(input.3)]

This is pretty much what the SQL optimizer would do when translating the relational aggregation operator into a Flink function.

Am Mo., 29. Apr. 2019 um 18:43 Uhr schrieb deklanw <[hidden email]>:

Hi,

Thanks for the reply.

I had already almost completely lost hope in using Flink SQL. You have
confirmed that.

But, like I said, I don't know how to reduce the large amount of boilerplate
I foresee this requiring with the DataStream API. Can you help me with that?
You mention "parameterizable aggregation functions", can you show an
example? I don't know how to do this without reinventing AVG and COUNT over
and over again.

Thanks

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

deklanw

Re: Working around lack of SQL triggers

Thanks for the reply.

Sorry for my lateness, I put this project on hold for awhile but I'm looking
back into it.

I think I see what you're saying but I'm not sure how to actually implement
that parameterizable aggregation function. I understand what it would be
conceptually, but I'm new to Scala/Java so actually implementing it seems
daunting. Do you have a link to some code where I can see something similar
I can work from? Preferably many such examples :P

Thanks!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/