Login  Register

Performance insights

classic Classic list List threaded Threaded
7 messages Options Options
Embed post
Permalink
Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Performance insights

Flavio Pompermaier
858 posts
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
  • # of network buffers 4096
  • parallelism: 36
  • The job fails because I have not enough network buffers
Scenario 2:
  • # of network buffers 8192
  • parallelism: 36
  • The job ends successfully in about 20 minutes 
Scenario 3:
  • # of network buffers 4096
  • 6 nodes
  • parallelism: 6
  • The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

Flavio Pompermaier
858 posts
Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
  • # of network buffers 4096
  • parallelism: 36
  • The job fails because I have not enough network buffers
Scenario 2:
  • # of network buffers 8192
  • parallelism: 36
  • The job ends successfully in about 20 minutes 
Scenario 3:
  • # of network buffers 4096
  • 6 nodes
  • parallelism: 6
  • The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

Stephan Ewen
1172 posts
Yes, that is definitely one possible explanation.

Another one could be that there is data skew, that increased parallelism does not take work of the most overloaded partition (but reduces available memory from that partition).
The web dashboard should actually help you with checking that.


On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier <[hidden email]> wrote:
Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
  • # of network buffers 4096
  • parallelism: 36
  • The job fails because I have not enough network buffers
Scenario 2:
  • # of network buffers 8192
  • parallelism: 36
  • The job ends successfully in about 20 minutes 
Scenario 3:
  • # of network buffers 4096
  • 6 nodes
  • parallelism: 6
  • The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

Flavio Pompermaier
858 posts
Is there an easy way to understand if and when my data get skewed in the pipeline?

On Fri, Feb 5, 2016 at 4:09 PM, Stephan Ewen <[hidden email]> wrote:
Yes, that is definitely one possible explanation.

Another one could be that there is data skew, that increased parallelism does not take work of the most overloaded partition (but reduces available memory from that partition).
The web dashboard should actually help you with checking that.


On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier <[hidden email]> wrote:
Sorry, I forgot to say that the numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,

I'm testing how to speed up my Flink job and I faced the following situations in my 6 nodes cluster (where each node has 8 CPUs) and 1 node does also the job manager:

Scenario 1:
  • # of network buffers 4096
  • parallelism: 36
  • The job fails because I have not enough network buffers
Scenario 2:
  • # of network buffers 8192
  • parallelism: 36
  • The job ends successfully in about 20 minutes 
Scenario 3:
  • # of network buffers 4096
  • 6 nodes
  • parallelism: 6
  • The job ends successfully in about 11 minutes
What can I infer from those results? That my job is I/O bounded thus having more threads in the same machine accessing simultaneously to the disk downgrade the performance of the pipeline?

Best,
Flavio




Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

Ufuk Celebi
568 posts

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk

Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

Flavio Pompermaier
858 posts

And what if I detect some skewness in some task? Do I have to try to call rebalance()?is there a way to identify the keys causing the skewness?

On 5 Feb 2016 21:33, "Ufuk Celebi" <[hidden email]> wrote:

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk

Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: Performance insights

rmetzger0
1086 posts
You can count the number of elements per key. This allows you to see how they are distributed.

On Sat, Feb 6, 2016 at 1:23 PM, Flavio Pompermaier <[hidden email]> wrote:

And what if I detect some skewness in some task? Do I have to try to call rebalance()?is there a way to identify the keys causing the skewness?

On 5 Feb 2016 21:33, "Ufuk Celebi" <[hidden email]> wrote:

> On 05 Feb 2016, at 16:38, Flavio Pompermaier <[hidden email]> wrote:
>
> Is there an easy way to understand if and when my data get skewed in the pipeline?

Yes, the web frontend shows how many bytes and records the sub tasks send and receive respectively. Skew would show as some tasks having higher numbers than the others.

– Ufuk