Dataset column statistics

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataset column statistics

Flavio Pompermaier
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Fabian Hueske-2
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Flavio Pompermaier
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Fabian Hueske-2
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.



Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Kurt Young
Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community. 

Best,
Kurt


On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.



Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Flavio Pompermaier
Great, thanks! 

On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:
Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community. 

Best,
Kurt


On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.



Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Dataset column statistics

Flavio Pompermaier
In reply to this post by Kurt Young
Any news on  this Kurt?
Could you share some insight about how you implemented it?
I'm debated whether to run multiple jobs or if analyze could be performed in a single big job

Best,
Flavio

On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:
Hi,

We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community. 

Best,
Kurt


On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by column for better projection push-down.

This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.



Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?

Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi to all,
I have a batch dataset  and I want to get some standard info about its columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API like the following:

SELECT 
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage collection).
Is there any smarter way to achieve this goal (apart from running a job per column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio