Flink Streaming Job Tuning help

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Streaming Job Tuning help

Senthil Kumar

Hello Flink Community!

 

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

 

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

 

If so what resources are recommended on this list?

 

Cheers

Kumar

Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming Job Tuning help

Senthil Kumar

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

 

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

 

Hello Flink Community!

 

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

 

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

 

If so what resources are recommended on this list?

 

Cheers

Kumar

Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming Job Tuning help

Zhijiang(wangzhijiang999)
Hi Kumar,

I can give some general ideas for further analysis. 

We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)
The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism, 
the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,
especially your record size 10K is a bit large.

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the
bottleneck to cause the back pressure, and you can trace it by network related metrics. 

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

Best,
Zhijiang
------------------------------------------------------------------
From:Senthil Kumar <[hidden email]>
Send Time:2020年5月13日(星期三) 00:49
Subject:Re: Flink Streaming Job Tuning help

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

 

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

 

Hello Flink Community!

 

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

 

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

 

If so what resources are recommended on this list?

 

Cheers

Kumar


Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming Job Tuning help

Senthil Kumar

Zhijiang,

 

Thanks for your suggestions. We will keep it in mind!

 

Kumar

 

From: Zhijiang <[hidden email]>
Reply-To: Zhijiang <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:10 PM
To: Senthil Kumar <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink Streaming Job Tuning help

 

Hi Kumar,



I can give some general ideas for further analysis. 

 

We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)

The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism, 

the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,

especially your record size 10K is a bit large.

 

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the

bottleneck to cause the back pressure, and you can trace it by network related metrics. 

 

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

 

Best,

Zhijiang

------------------------------------------------------------------

From:Senthil Kumar <[hidden email]>

Send Time:2020513(星期三) 00:49

Subject:Re: Flink Streaming Job Tuning help

 

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

 

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

 

Hello Flink Community!

 

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

 

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

 

If so what resources are recommended on this list?

 

Cheers

Kumar

 

Reply | Threaded
Open this post in threaded view
|

Re: Flink Streaming Job Tuning help

Arvid Heise-3
Hi Senthil,

since your records are so big, I recommend to take the time to evaluate some different serializers [1].


On Wed, May 13, 2020 at 5:40 PM Senthil Kumar <[hidden email]> wrote:

Zhijiang,

 

Thanks for your suggestions. We will keep it in mind!

 

Kumar

 

From: Zhijiang <[hidden email]>
Reply-To: Zhijiang <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:10 PM
To: Senthil Kumar <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink Streaming Job Tuning help

 

Hi Kumar,



I can give some general ideas for further analysis. 

 

We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network)

The `keyBy` would break the chained operators, so it might bring obvious performance sensitive in practice. I guess if your previous way without keyBy can make use of chained mechanism, 

the follow-up operator can consume the emitted records from the preceding operator directly, no need to involve in buffer serialization-> network shuffle -> buffer deserializer processes,

especially your record size 10K is a bit large.

 

If the keyBy is necessary in your case, then you can further check the current bottleneck. E.g. whether there are back pressure which you can monitor from web UI. If so, which task is the

bottleneck to cause the back pressure, and you can trace it by network related metrics. 

 

Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load.

 

Best,

Zhijiang

------------------------------------------------------------------

From:Senthil Kumar <[hidden email]>

Send Time:2020513(星期三) 00:49

Subject:Re: Flink Streaming Job Tuning help

 

I forgot to mention, we are consuming said records from AWS kinesis and writing out to S3.

 

From: Senthil Kumar <[hidden email]>
Date: Tuesday, May 12, 2020 at 10:47 AM
To: "[hidden email]" <[hidden email]>
Subject: Flink Streaming Job Tuning help

 

Hello Flink Community!

 

We have a fairly intensive flink streaming application, processing 8-9 million records a minute, with each record being 10k.

One of our steps is a keyBy operation. We are finding that flink lags seriously behind when we introduce the keyBy (presumably because of shuffle across the network).

 

We are trying to tune it ourselves (size of nodes, memory, network buffers etc), but before we spend way too much time on

this; would it be better to hire some “flink tuning expert” to get us through?

 

If so what resources are recommended on this list?

 

Cheers

Kumar

 



--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng