Unexpected latency across operator instances

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Unexpected latency across operator instances

Antonis Papaioannou

Hi,

I experience a strange behaviour with our Flink application. So I created a very simple sample application to demonstrate the problem.

A simple Flink application reads data from Kakfa, perfoms a simple transformation and accesses an external Redis database to read data within a FlatMap operator. When running the application with parallelism higher than 1, there is an unexpected high latency only on one operator instance (the “bad” instance is not always the same, it is randomly “selected” across multiple runs) that accesses the external database. There multiple Redis instances, all running in standalone mode, so each Redis request is served by the local instance. To demonstrate that the latency is not related to the Redis, I completely removed the database access and simulated its latency with a sleep operation for about 0.1 ms, resulting to the same strange behavior.

Profiling the application by enabling the Flink monitoring mechanism, we see that all instances of the upstream operator is backpressured and the input buffer pool (and the input exclusive buffer pool) usage on the “bad” node are 100% during the whole run.

There is no skew in the dataset. I also replaces the keyBy with rebalance which follows a round-robbin data distribution but there is no difference. 

I expected all nodes to exhibit similar (either low or high) latency. So the question is why only one operator instance exhibits high latency? Is there any change there is a starvation problem due to credit-based flow control?

Removing the keyBy between the operators, the system exhibits the expected behaviour.

I also attach a pdf with more details about the application and graphs with monitoring data.

I hope someone could have an idea about this unexpected behaviour.

Thank you,
Antonis




unexpected_latency_report.pdf (761K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unexpected latency across operator instances

Paul Lam
Hi Antonis,

Did you try to profile the “bad” taskmanager to see what the task thread was busy doing?

And a possible culprit might be gc, if you haven't checked that. I’ve seen gc threads eating up 30% of cpu.

Best,
Paul Lam

2020年12月14日 06:24,Antonis Papaioannou <[hidden email]> 写道:

Hi,

I experience a strange behaviour with our Flink application. So I created a very simple sample application to demonstrate the problem.

A simple Flink application reads data from Kakfa, perfoms a simple transformation and accesses an external Redis database to read data within a FlatMap operator. When running the application with parallelism higher than 1, there is an unexpected high latency only on one operator instance (the “bad” instance is not always the same, it is randomly “selected” across multiple runs) that accesses the external database. There multiple Redis instances, all running in standalone mode, so each Redis request is served by the local instance. To demonstrate that the latency is not related to the Redis, I completely removed the database access and simulated its latency with a sleep operation for about 0.1 ms, resulting to the same strange behavior.

Profiling the application by enabling the Flink monitoring mechanism, we see that all instances of the upstream operator is backpressured and the input buffer pool (and the input exclusive buffer pool) usage on the “bad” node are 100% during the whole run.

There is no skew in the dataset. I also replaces the keyBy with rebalance which follows a round-robbin data distribution but there is no difference. 

I expected all nodes to exhibit similar (either low or high) latency. So the question is why only one operator instance exhibits high latency? Is there any change there is a starvation problem due to credit-based flow control?

Removing the keyBy between the operators, the system exhibits the expected behaviour.

I also attach a pdf with more details about the application and graphs with monitoring data.

I hope someone could have an idea about this unexpected behaviour.

Thank you,
Antonis

<unexpected_latency_report.pdf>