(DEPRECATED) Apache Flink User Mailing List archive.

Dataset column statistics

Classic

List

Threaded

7 messages Options

Flavio Pompermaier

Dataset column statistics

Hi to all,

I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).

In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT

MAX(col1), MIN(col1), AVG(col1),

MAX(col2), MIN(col2), AVG(col2),

MAX(col3), MIN(col3), AVG(col3)

FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).

It seems that this kind of job cause the cluster to crash (too much garbage collection).

Is there any smarter way to achieve this goal (apart from running a job per column)?

Is this "normal" or is this a bug of Flink?

Best,

Flavio

Fabian Hueske-2

Re: Dataset column statistics

Hi,

You could try to enable object reuse.

Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Flavio Pompermaier

Re: Dataset column statistics

What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,

Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:

Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Fabian Hueske-2

Re: Dataset column statistics

I'd try to tune it in a single query.

If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.

I don't see a reason why it shouldn't be added in the future.

Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:

What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Kurt Young

Re: Dataset column statistics

Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community.

Best,

Kurt

On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:

I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.

Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Flavio Pompermaier

Re: Dataset column statistics

Great, thanks!

On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:

Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community.

Best,
Kurt

On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.

Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Flavio Pompermaier

Re: Dataset column statistics

In reply to this post by Kurt Young

Any news on this Kurt?

Could you share some insight about how you implemented it?

I'm debated whether to run multiple jobs or if analyze could be performed in a single big job

Best,

Flavio

On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:

Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community.

Best,
Kurt

On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.

Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian

Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio