(DEPRECATED) Apache Flink User Mailing List archive.

The parallelism of sink is always 1 in sqlUpdate

Classic

List

Threaded

4 messages Options

faaron zheng

The parallelism of sink is always 1 in sqlUpdate

Hi all,

I am trying to use flink sql to run hive task. I use tEnv.sqlUpdate to execute my sql which looks like "insert overtwrite ... select ...". But I find the parallelism of sink is always 1, it's intolerable for large data. Why it happens? Otherwise, Is there any guide to decide the memory of taskmanager when I have two huge table to hashjoin, for example, each table has several TB data?

Thanks,

Faaron

Jingsong Li

Re: The parallelism of sink is always 1 in sqlUpdate

Hi faaron,

For sink parallelism.

- What is parallelism of the input of sink? The sink parallelism should be same.

- Does you sql have order by or limit ?

Flink batch sql not support range partition now, so it will use single parallelism to run order by.

For the memory of taskmanager.

There is manage memory option to configure.

[1] https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html#managed-memory

Best,

Jingsong Lee

On Fri, Mar 6, 2020 at 5:38 PM faaron zheng <[hidden email]> wrote:

Hi all,

I am trying to use flink sql to run hive task. I use tEnv.sqlUpdate to execute my sql which looks like "insert overtwrite ... select ...". But I find the parallelism of sink is always 1, it's intolerable for large data. Why it happens? Otherwise, Is there any guide to decide the memory of taskmanager when I have two huge table to hashjoin, for example, each table has several TB data?

Thanks,
Faaron

Best, Jingsong Lee

faaron zheng

Re: The parallelism of sink is always 1 in sqlUpdate

Thanks for you attention. The input of sink is 500, and there is no order by and limit.

Jingsong Li <[hidden email]> 于 2020年3月6日周五下午6:15写道：

Hi faaron,

For sink parallelism.
- What is parallelism of the input of sink? The sink parallelism should be same.
- Does you sql have order by or limit ?
Flink batch sql not support range partition now, so it will use single parallelism to run order by.

For the memory of taskmanager.
There is manage memory option to configure.

[1] https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html#managed-memory

Best,
Jingsong Lee

On Fri, Mar 6, 2020 at 5:38 PM faaron zheng <[hidden email]> wrote:
Hi all,

I am trying to use flink sql to run hive task. I use tEnv.sqlUpdate to execute my sql which looks like "insert overtwrite ... select ...". But I find the parallelism of sink is always 1, it's intolerable for large data. Why it happens? Otherwise, Is there any guide to decide the memory of taskmanager when I have two huge table to hashjoin, for example, each table has several TB data?

Thanks,
Faaron

--
Best, Jingsong Lee

Jingsong Li

Re: The parallelism of sink is always 1 in sqlUpdate

Which sink do you use?

It depends on sink implementation like [1]

[1] https://github.com/apache/flink/blob/2b13a4155fd4284f6092decba867e71eea058043/flink-table/flink-table-api-java-bridge/src/main/java/org/apache/flink/table/sinks/CsvTableSink.java#L147

Best,

Jingsong Lee

On Fri, Mar 6, 2020 at 6:37 PM faaron zheng <[hidden email]> wrote:

Thanks for you attention. The input of sink is 500, and there is no order by and limit.

Jingsong Li <[hidden email]> 于 2020年3月6日周五下午6:15写道：
Hi faaron,

For sink parallelism.
- What is parallelism of the input of sink? The sink parallelism should be same.
- Does you sql have order by or limit ?
Flink batch sql not support range partition now, so it will use single parallelism to run order by.

For the memory of taskmanager.
There is manage memory option to configure.

[1] https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html#managed-memory

Best,
Jingsong Lee

On Fri, Mar 6, 2020 at 5:38 PM faaron zheng <[hidden email]> wrote:
Hi all,

I am trying to use flink sql to run hive task. I use tEnv.sqlUpdate to execute my sql which looks like "insert overtwrite ... select ...". But I find the parallelism of sink is always 1, it's intolerable for large data. Why it happens? Otherwise, Is there any guide to decide the memory of taskmanager when I have two huge table to hashjoin, for example, each table has several TB data?

Thanks,
Faaron

--
Best, Jingsong Lee

Best, Jingsong Lee