Hi to all,
I have a batch dataset and I want to get some standard info about its columns (like min, max, avg etc). In order to achieve this I wrote a simple program that use SQL on table API like the following: SELECT MAX(col1), MIN(col1), AVG(col1), MAX(col2), MIN(col2), AVG(col2), MAX(col3), MIN(col3), AVG(col3) FROM MYTABLE In my dataset I have about 50 fields and the query becomes quite big (and the job plan too). It seems that this kind of job cause the cluster to crash (too much garbage collection). Is there any smarter way to achieve this goal (apart from running a job per column)? Is this "normal" or is this a bug of Flink? Best, Flavio |
Hi, You could try to enable object reuse. Alternatively you can give more heap memory or fine tune the GC parameters. I would not consider it a bug in Flink, but might be something that could be improved. Fabian Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <[hidden email]>:
|
What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once? Are you ever going to consider supporting ANALYZE TABLE (like in Hive or Spark) in Flink Table API? Best, Flavio On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <[hidden email]> wrote:
|
I'd try to tune it in a single query. If that does not work, go for as few queries as possible, splitting by column for better projection push-down. This is the first time I hear somebody requesting ANALYZE TABLE. I don't see a reason why it shouldn't be added in the future. Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <[hidden email]>:
|
Hi, We have implemented ANALYZE TABLE in our internal version of Flink, and we will try to contribute back to the community. Best, Kurt On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <[hidden email]> wrote:
|
Great, thanks! On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:
|
In reply to this post by Kurt Young
Any news on this Kurt? Could you share some insight about how you implemented it? I'm debated whether to run multiple jobs or if analyze could be performed in a single big job Best, Flavio On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |