(DEPRECATED) Apache Flink User Mailing List archive.

Performance insights

Classic

List

Threaded

7 messages Options

Flavio Pompermaier

Performance insights

Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:

# of network buffers 4096
parallelism: 36
The job fails because I have not enough network buffers

Scenario 2:

# of network buffers 8192
parallelism: 36
The job ends successfully in about 20 minutes

Scenario 3:

# of network buffers 4096
6 nodes
parallelism: 6
The job ends successfully in about 11 minutes

What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,

Flavio

Flavio Pompermaier

Re: Performance insights

Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:

Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
# of network buffers 4096
parallelism: 36
The job fails because I have not enough network buffers
Scenario 2:
# of network buffers 8192
parallelism: 36
The job ends successfully in about 20 minutes
Scenario 3:
# of network buffers 4096
6 nodes
parallelism: 6
The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio

Stephan Ewen

Re: Performance insights

Yes, that is definitely one possible explanation.

Another one could be that there is data skew, that increased parallelism does not take work of the most overloaded partition (but reduces available memory from that partition).

The web dashboard should actually help you with checking that.

On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier <[hidden email]> wrote:

Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
# of network buffers 4096
parallelism: 36
The job fails because I have not enough network buffers
Scenario 2:
# of network buffers 8192
parallelism: 36
The job ends successfully in about 20 minutes
Scenario 3:
# of network buffers 4096
6 nodes
parallelism: 6
The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio

Flavio Pompermaier

Re: Performance insights

Is there an easy way to understand if and when my data get skewed in the pipeline?

On Fri, Feb 5, 2016 at 4:09 PM, Stephan Ewen <[hidden email]> wrote:

Yes, that is definitely one possible explanation.

Another one could be that there is data skew, that increased parallelism does not take work of the most overloaded partition (but reduces available memory from that partition).
The web dashboard should actually help you with checking that.

On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier <[hidden email]> wrote:
Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
# of network buffers 4096
parallelism: 36
The job fails because I have not enough network buffers
Scenario 2:
# of network buffers 8192
parallelism: 36
The job ends successfully in about 20 minutes
Scenario 3:
# of network buffers 4096
6 nodes
parallelism: 6
The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio

Ufuk Celebi

Re: Performance insights

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk

Flavio Pompermaier

Re: Performance insights

And what if I detect some skewness in some task? Do I have to try to call rebalance()?is there a way to identify the keys causing the skewness?

On 5 Feb 2016 21:33, "Ufuk Celebi" <[hidden email]> wrote:

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk

rmetzger0

Re: Performance insights

You can count the number of elements per key. This allows you to see how they are distributed.

On Sat, Feb 6, 2016 at 1:23 PM, Flavio Pompermaier <[hidden email]> wrote:

And what if I detect some skewness in some task? Do I have to try to call rebalance()?is there a way to identify the keys causing the skewness?

On 5 Feb 2016 21:33, "Ufuk Celebi" <[hidden email]> wrote:

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk