(DEPRECATED) Apache Flink User Mailing List archive.

Flink Streaming Job Tuning help

Classic

List

Threaded

5 messages Options

Senthil Kumar

Flink Streaming Job Tuning help

Hello Flink Community!

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

If so what resources are recommended on this list?

Cheers

Kumar

Senthil Kumar

Re: Flink Streaming Job Tuning help

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

Hello Flink Community!

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

If so what resources are recommended on this list?

Cheers

Kumar

Zhijiang(wangzhijiang999)

Re: Flink Streaming Job Tuning help

Hi Kumar,

I can give some general ideas for further analysis.

> We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)

The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism,

the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,

especially your record size 10K is a bit large.

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the

bottleneck to cause the back pressure, and you can trace it by network related metrics.

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

Best,

Zhijiang

------------------------------------------------------------------
From:Senthil Kumar <[hidden email]>
Send Time:2020年5月13日(星期三) 00:49
To:[hidden email] <[hidden email]>
Subject:Re: Flink Streaming Job Tuning help

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

Hello Flink Community!

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.
One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on
this; would it be better to hire some “flink tuning expert” to get us through?

If so what resources are recommended on this list?

Cheers
Kumar

Senthil Kumar

Re: Flink Streaming Job Tuning help

Zhijiang,

Thanks for your suggestions. We will keep it in mind!

Kumar

From: Zhijiang <[hidden email]>
Reply-To: Zhijiang <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:10 PM
To: Senthil Kumar <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink Streaming Job Tuning help

Hi Kumar,

I can give some general ideas for further analysis.

> We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)

The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism,

the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,

especially your record size 10K is a bit large.

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the

bottleneck to cause the back pressure, and you can trace it by network related metrics.

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

Best,

Zhijiang

------------------------------------------------------------------

From:Senthil Kumar <[hidden email]>

Send Time:2020年5月13日(星期三) 00:49

To:[hidden email] <[hidden email]>

Subject:Re: Flink Streaming Job Tuning help

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

Hello Flink Community!

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

If so what resources are recommended on this list?

Cheers

Kumar

Arvid Heise-3

Re: Flink Streaming Job Tuning help

Hi Senthil,

since your records are so big, I recommend to take the time to evaluate some different serializers [1].

[1] https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Wed, May 13, 2020 at 5:40 PM Senthil Kumar <[hidden email]> wrote:

Zhijiang,

Thanks for your suggestions. We will keep it in mind!

Kumar

From: Zhijiang <[hidden email]>
Reply-To: Zhijiang <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:10 PM
To: Senthil Kumar <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink Streaming Job Tuning help

Hi Kumar,

I can give some general ideas for further analysis.

> We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)

The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism,

the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,

especially your record size 10K is a bit large.

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the

bottleneck to cause the back pressure, and you can trace it by network related metrics.

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

Best,

Zhijiang

------------------------------------------------------------------

From:Senthil Kumar <[hidden email]>

Send Time:2020年5月13日(星期三) 00:49

[hidden email] <[hidden email]>

Subject:Re: Flink Streaming Job Tuning help

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

Hello Flink Community!

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

If so what resources are recommended on this list?

Cheers

Kumar

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng