Hi, I am confused with sharing tcp connection for the same connectionID, if two tasks share the same connection, and there is no available buffer in the local buffer pool of the first task , then it will set autoread as false for the channel, but it will effect the second task if it still has available buffer. So if one of the tasks no available buffer , all the other tasks can not read data from channel because of this. My understanding is right? If so, are there any improvements for it? Thank you for any help! |
Yes, that is a correct description of the state of things.
A way to improve this is to introduce flow control in the application layer, where consumers only receive buffers when they have buffers available. They could announce on the channel how many buffers they have before they receive anything. This way there will be no blocking of the channel and we could actually multiplex more consumers over the same channel. The implementation is probably a little tricky, but if you want to work on this and have time to actually do it, we can think about the details. :-) Would you be interested? If yes, let's schedule a Hangout where we brainstorm about the solution and how to implement it. Ideally, we would come up with a design document, which we share on the mailing list and then we continue implementing it. I currently only have time to act as a guide/mentor and you would have to do most of the implementation. – Ufuk On Mon, May 23, 2016 at 5:40 AM, wangzhijiang999 <[hidden email]> wrote: > Hi, > > I am confused with sharing tcp connection for the same connectionID, if > two tasks share the same connection, and there is no available buffer in the > local buffer pool of the first task , then it will set autoread as false > for the channel, but it will effect the second task if it still has > available buffer. So if one of the tasks no available buffer , all the other > tasks can not read data from channel because of this. My understanding is > right? If so, are there any improvements for it? Thank you for any help! > > > > > |
In reply to this post by Zhijiang(wangzhijiang999)
Hi Ufuk, Thank you for the detail explaination! As we confirmed that the task will set the autoread as false for the sharing channel when no available segment buffer. In further, when this task has available buffer again, it will notify the event to set the autoread as true. But in some scenarios, there would be a propobility that the autoread for this sharing channel would not be set as true anymore. That is , when available buffer to notify event and currently there are some messages staged in the queue, it would process these messages first, the message shoule be put on input channel buffer in common way, but if the task failed and the buffer pool is released, it will return false when process the message, so the channel will not be set as autoread true any more, then all the other tasks sharing this channel will be effected. In summary, if one task set autoread as false, and when it notify the available buffer, there are some messages during this time to be processed first, if one message belongs to another failed task, the autoread for this channel would not be set true anymore. The only way is to cancel all the tasks in this channel to release the channel. Is it right? In the past, I improved the failover strategy based on flink for our application and noticed this issue. Also i am very interested and pleasure to do some related work for flink improvement as you mentioned. Actually i am working on improving flink in many ways for our application, and wish further contact with you for the professional advise. Thank you again! Zhijiang Wang
|
On Mon, May 23, 2016 at 6:55 PM, wangzhijiang999
<[hidden email]> wrote: > In summary, if one task set autoread as false, and when it notify the > available buffer, there are some messages during this time to be processed > first, if one message belongs to another failed task, the autoread for this > channel would not be set true anymore. The only way is to cancel all the > tasks in this channel to release the channel. Is it right? Yes, very good observation. In this sense the failure model of Flink is baked in into the way the channels are multiplexed, which is a bad thing (as you already noticed with your improved failover strategy). If you want, let me look into this issue on a high level and let's fix this together as a first step. Let's have a chat about this by the end of the week. Does this work for you? After that, we can continue with the flow control issue, which is definitely a bigger task. – Ufuk |
I am not Flink master or regular user of FLink , but would like to start contributing to Flink. Would it be possible to get involved on this issues and start contributing to Flink community?On Mon, May 23, 2016 at 10:49 PM, Ufuk Celebi <[hidden email]> wrote: On Mon, May 23, 2016 at 6:55 PM, wangzhijiang999 -- |
On Mon, May 23, 2016 at 7:30 PM, Deepak Sharma <[hidden email]> wrote:
> Would it be possible to get involved on this issues and start contributing > to Flink community? Hey Deepak! Nice to see that you are also interested in this. If you are new to Flink I would recommend to start contributing by looking into one of the starter issues: https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20starter The lower layers are quite tricky. If it's OK with you, I would rather post any follow ups like JIRA issues or design docs in this thread, at which point it will be possible to chime in here, too. In the mean time the issues I've linked are a better place to start I think. – Ufuk |
Thanks Ufuk. -Deepak On 23 May 2016 11:21 pm, "Ufuk Celebi" <[hidden email]> wrote:
On Mon, May 23, 2016 at 7:30 PM, Deepak Sharma <[hidden email]> wrote: |
In reply to this post by Zhijiang(wangzhijiang999)
Hi Ufuk, I am willing to do some work for this issue and has a basic solution for it. And wish to get professional suggestion from you. What is the next step for it ? Looking forward to your reply! Zhijiang Wang
|
Free forum by Nabble | Edit this page |