"Yes, Flink 1.5.0 will come with better tools to handle this problem. Namely you will be able to limit the “in flight” data, by controlling the number of assigned credits per channel/input gate. Even without any configuring Flink 1.5.0 will out of the box buffer less data, thus mitigating the problem."
I read this in another email chain. The docs ( may be you can point me to them ) are not very clear on how to do the above. Any pointers will be appreciated. Thanks much. |
Hi Vishal, Before Flink-1.5.0, the sender tries best to send data on the network until the wire is filled with data. From Flink-1.5.0 the network flow control is improved by credit-based idea. That means the sender transfers data based on how many buffers avaiable on receiver side, so there will be no data accumulated on the wire. From this point, the in-flighting data is less than before. Also you can further limit the in-flighting data by controling the number of credits on receiver side, and the related parameters are taskmanager.network.memory.buffers-per-channel and taskmanager.network.memory.floating-buffers-per-gate. If you have other questions about them, let me know then i can explain for you. Zhijiang
|
Awesome, thank you for pointing that out. We have seen stability on pipes where previously throttling the source ( rateLimiter ) was the only way out. https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/TaskManagerOptions.java#L291 This though seems to be a cluster wide setting. Is it possible to do this at an operator level ? Does this work with the pipe level configuration per job ( or has that been deprecated ) On Thu, Jul 5, 2018 at 11:16 PM, Zhijiang(wangzhijiang999) <[hidden email]> wrote:
|
Further if there is are metrics that allows us to chart delays per pipe on n/w buffers, that would be immensely helpful. On Fri, Jul 6, 2018 at 10:02 AM, Vishal Santoshi <[hidden email]> wrote:
|
The config you mentioned is not operator level, but can be setted at job level currently I think. The operator level needs the API support but seems more reasonable. There exists "inPoolUsage" and "outPoolUsage" metrcis to indicate backpreesure to some extent. If the percentages of these metrics are both 100% between producer and consumer, the producer will be blocked (backpressure) by the consumer for a while. Also there exists latency marker from source to sink in the whole topology to sample latency. Maybe you can resort these metrics for some helps.
|
Free forum by Nabble | Edit this page |