(DEPRECATED) Apache Flink User Mailing List archive.

Approaches to customize the parallelism in SQL generated operators

Classic

List

Threaded

3 messages Options

Kai Fu

Approaches to customize the parallelism in SQL generated operators

Hi team,

Currently the SQL generated operator has all the same parallelism by default, and we faced a issue that the in the case of multiple join, the operator at later stage faces larger computation so that the overall pipeline is back-presured and it causes checkpoint fail(expired) occasionaly.

We want to know that if there is any way to customize the parallelism of the SQL generated operators individually so that we can make their powers match with their actual load to make operators' load evenly distributed.

Except to customize the parallelism of the operators, is there any other suggested way to solve the problem and best practices for such multiple joins? Thank you in advance.

Best regards,

- Kai

David Anderson-4

Re: Approaches to customize the parallelism in SQL generated operators

No, there is no mechanism available for individually tuning the parallelism of the generated operators in a SQL job. Moreover, such fine-tuning is often counter-productive. In most cases you are better off simply setting the overall parallelism to whatever is needed by the busiest operator(s). Unnecessary changes in parallelism force additional network shuffles (unless done in concert with a keyBy), and create an uneven distribution of load, with some slots having more operators than others.

Regards,

David

On Thu, Mar 18, 2021 at 1:03 PM eef hhj <[hidden email]> wrote:

Hi team,

Currently the SQL generated operator has all the same parallelism by default, and we faced a issue that the in the case of multiple join, the operator at later stage faces larger computation so that the overall pipeline is back-presured and it causes checkpoint fail(expired) occasionaly.

We want to know that if there is any way to customize the parallelism of the SQL generated operators individually so that we can make their powers match with their actual load to make operators' load evenly distributed.

Except to customize the parallelism of the operators, is there any other suggested way to solve the problem and best practices for such multiple joins? Thank you in advance.

--
Best regards,
- Kai

Kai Fu

Re: Approaches to customize the parallelism in SQL generated operators

Hi David,

Thank you for the response. We are facing a situation of cold start for our application. In the cold start phase, it requires a lot of parallelism to make the busiest operator not overwhelmed so that there will be no backpresure and no checkpoint works as normal. The problem is that such over provisioned parallelism is far more than the one required by the normal traffic from the stream, which is quite a waste.

Currently, we're thinking to limit the read frequency from the connector(Kafka) side. By limiting the throughput of each single parallelism, so that the downstream operators can well handle the traffic during cold start. Per our observation, it works, not sure if this is the suggested way for that. Any other suggestion is appreciated.

Another direclty we want to explore is only to change parallelism of the source consumer, but not the subsequent ones, any further concerns of this approach?

-- Best wishes

Kai

On Sun, Mar 21, 2021 at 1:01 AM David Anderson <[hidden email]> wrote:

No, there is no mechanism available for individually tuning the parallelism of the generated operators in a SQL job. Moreover, such fine-tuning is often counter-productive. In most cases you are better off simply setting the overall parallelism to whatever is needed by the busiest operator(s). Unnecessary changes in parallelism force additional network shuffles (unless done in concert with a keyBy), and create an uneven distribution of load, with some slots having more operators than others.

Regards,
David

On Thu, Mar 18, 2021 at 1:03 PM eef hhj <[hidden email]> wrote:
Hi team,

Currently the SQL generated operator has all the same parallelism by default, and we faced a issue that the in the case of multiple join, the operator at later stage faces larger computation so that the overall pipeline is back-presured and it causes checkpoint fail(expired) occasionaly.

We want to know that if there is any way to customize the parallelism of the SQL generated operators individually so that we can make their powers match with their actual load to make operators' load evenly distributed.

Except to customize the parallelism of the operators, is there any other suggested way to solve the problem and best practices for such multiple joins? Thank you in advance.

--
Best regards,
- Kai

Best regards,

- Kai