Hi Peter&Till:
As commented in the issue,We have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions: 1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; Best regards, Anyang |
Hi Anyang, Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input. Best Regards Peter Huang On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
|
Thank you for the reply and look forward to the advice of Till. Anyang Peter Huang <[hidden email]> 于2019年9月5日周四 下午11:53写道:
|
Hi Anyang, thanks for your suggestions. 1) I guess one needs to make this interval configurable. A session cluster could theoretically execute batch as well as streaming tasks and, hence, I doubt that there is an optimal value. Maybe the default could be a bit longer than 1 min, though. 2) Which component to do you want to let terminate immediately? I think we can consider your input while reviewing the PR. If it would be a bigger change, then it would be best to create a follow up issue once FLINK-10868 has been merged. Cheers, Till On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <[hidden email]> wrote:
|
Hi Till, Thank you for the reply. 1. The batch processing may be customized according to the usage scenario. For our online batch jobs, we set the interval parameter to 8h. 2. For our usage scenario, we need the client to exit immediately when the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE. Best Regards, Anyang Till Rohrmann <[hidden email]> 于2019年9月6日周五 下午9:33写道:
|
Hi Till, 1) From Anyang's request, I think it is reasonable to use two parameters for the rate as a batch job runs for a while. The failure rate in a small interval is meaningless. I think they need a failure count from the beginning as the failure condition. 2) In the current implementation, the MaximumFailedTaskManagerExceedingException is SuppressRestartsException. It will exit immediately. Best Regards Peter Huang On Sun, Sep 8, 2019 at 1:27 AM Anyang Hu <[hidden email]> wrote:
|
Hi Peter, After the introduction of FLINK-13184 to start the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd: @Override
Best regards, Anyang |
Hi Anyang, I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back. Cheers, Till On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
|
Hi Till, Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to restore the connection? For suggestion 1, make interval configurable, given that we have done it, and if we can, we hope to give back to the community. Best regards, Anyang Till Rohrmann <[hidden email]> 于2019年9月9日周一 下午3:09写道:
|
Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements. Cheers, Till On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <[hidden email]> wrote:
|
Thanks Till, I will continue to follow this issue and see what we can do. Best regards, Anyang Till Rohrmann <[hidden email]> 于2019年9月11日周三 下午5:12写道:
|
Hi Anyang and Till, I think we agreed on making the interval configurable in this case. Let me revise the current PR. You can review it after that. Best Regards Peter Huang On Thu, Sep 12, 2019 at 12:53 AM Anyang Hu <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |