suggestion of FLINK-10868

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

suggestion of FLINK-10868

Anyang Hu
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Peter Huang
Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Anyang Hu
Thank you for the reply and look forward to the advice of Till.

Anyang

Peter Huang <[hidden email]> 于2019年9月5日周四 下午11:53写道:
Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Till Rohrmann
Hi Anyang,

thanks for your suggestions.

1) I guess one needs to make this interval configurable. A session cluster could theoretically execute batch as well as streaming tasks and, hence, I doubt that there is an optimal value. Maybe the default could be a bit longer than 1 min, though.

2) Which component to do you want to let terminate immediately?

I think we can consider your input while reviewing the PR. If it would be a bigger change, then it would be best to create a follow up issue once FLINK-10868 has been merged.

Cheers,
Till

On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <[hidden email]> wrote:
Thank you for the reply and look forward to the advice of Till.

Anyang

Peter Huang <[hidden email]> 于2019年9月5日周四 下午11:53写道:
Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Anyang Hu
Hi Till,
Thank you for the reply.

1. The batch processing may be customized according to the usage scenario. For our online batch jobs, we set the interval parameter to 8h.
2. For our usage scenario, we need the client to exit immediately when the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE.

Best Regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月6日周五 下午9:33写道:
Hi Anyang,

thanks for your suggestions.

1) I guess one needs to make this interval configurable. A session cluster could theoretically execute batch as well as streaming tasks and, hence, I doubt that there is an optimal value. Maybe the default could be a bit longer than 1 min, though.

2) Which component to do you want to let terminate immediately?

I think we can consider your input while reviewing the PR. If it would be a bigger change, then it would be best to create a follow up issue once FLINK-10868 has been merged.

Cheers,
Till

On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <[hidden email]> wrote:
Thank you for the reply and look forward to the advice of Till.

Anyang

Peter Huang <[hidden email]> 于2019年9月5日周四 下午11:53写道:
Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Peter Huang
Hi Till,

1) From Anyang's request, I think it is reasonable to use two parameters for the rate as a batch job runs for a while. The failure rate in a small interval is meaningless.
I think they need a failure count from the beginning as the failure condition.

2) In the current implementation, the MaximumFailedTaskManagerExceedingException is SuppressRestartsException. It will exit immediately.


Best Regards
Peter Huang




On Sun, Sep 8, 2019 at 1:27 AM Anyang Hu <[hidden email]> wrote:
Hi Till,
Thank you for the reply.

1. The batch processing may be customized according to the usage scenario. For our online batch jobs, we set the interval parameter to 8h.
2. For our usage scenario, we need the client to exit immediately when the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE.

Best Regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月6日周五 下午9:33写道:
Hi Anyang,

thanks for your suggestions.

1) I guess one needs to make this interval configurable. A session cluster could theoretically execute batch as well as streaming tasks and, hence, I doubt that there is an optimal value. Maybe the default could be a bit longer than 1 min, though.

2) Which component to do you want to let terminate immediately?

I think we can consider your input while reviewing the PR. If it would be a bigger change, then it would be best to create a follow up issue once FLINK-10868 has been merged.

Cheers,
Till

On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <[hidden email]> wrote:
Thank you for the reply and look forward to the advice of Till.

Anyang

Peter Huang <[hidden email]> 于2019年9月5日周四 下午11:53写道:
Hi Anyang,

Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input.



Best Regards
Peter Huang

On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <[hidden email]> wrote:
Hi Peter&Till:

As commented in the issueWe have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions:

1) Parameter control time interval. At present, the default time interval of 1 min is used, which is too short for batch tasks; 

2)Parameter Control When the failed Container number reaches MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform OnFatalError so that the batch tasks can exit as soon as possible.

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Anyang Hu
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Till Rohrmann
Hi Anyang,

I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Anyang Hu
Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to restore the connection? 

For suggestion 1, make interval configurable, given that we have done it, and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月9日周一 下午3:09写道:
Hi Anyang,

I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Till Rohrmann
Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements.

Cheers,
Till

On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <[hidden email]> wrote:
Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to restore the connection? 

For suggestion 1, make interval configurable, given that we have done it, and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月9日周一 下午3:09写道:
Hi Anyang,

I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Anyang Hu
Thanks Till, I will continue to follow this issue and see what we can do.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月11日周三 下午5:12写道:
Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements.

Cheers,
Till

On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <[hidden email]> wrote:
Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to restore the connection? 

For suggestion 1, make interval configurable, given that we have done it, and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月9日周一 下午3:09写道:
Hi Anyang,

I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang
Reply | Threaded
Open this post in threaded view
|

Re: suggestion of FLINK-10868

Peter Huang
Hi Anyang and Till,

I think we agreed on making the interval configurable in this case. Let me revise the current PR. You can review it after that.



Best Regards
Peter Huang

On Thu, Sep 12, 2019 at 12:53 AM Anyang Hu <[hidden email]> wrote:
Thanks Till, I will continue to follow this issue and see what we can do.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月11日周三 下午5:12写道:
Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements.

Cheers,
Till

On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <[hidden email]> wrote:
Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to restore the connection? 

For suggestion 1, make interval configurable, given that we have done it, and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <[hidden email]> 于2019年9月9日周一 下午3:09写道:
Hi Anyang,

I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection problems are expected and need to be handled gracefully. Instead if one deems the notifyAllocationFailure message to be very important, then one would need to keep it and tell the JM once it has connected back.

Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <[hidden email]> wrote:
Hi Peter,

For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect. 

After the introduction of FLINK-13184 to start  the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine  whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();

JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
} else {
if (exitProcessOnJobManagerTimedout) {
ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure.");
onFatalError(exception);
}
}
}

Best regards,
Anyang