Understanding Restart Strategy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding Restart Strategy

Ashish Pokharel
Team,

Hopefully, this is a quick one. 

We have setup restart strategy as follows in pretty much all of our apps:

    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Time.of(30, TimeUnit.SECONDS)));

This seems pretty straight-forward. App should retry starting 10 times every 30 seconds - so about 5 minutes. Either we are not understanding this or it seems inconsistent. Some of the applications restart and come back fine on issues like Kafka timeout (which I will come back to later) but in some cases same issues pretty much shuts the app down. 

My first guess here was that total count of 10 is not reset after App recovered normally. Is there a need to manually reset the counter in an App? I doubt Flink would be treating it like a counter that spans the life of an App instead of resetting on successful start-up - but not sure how else to explain the behavior.

Along the same line, what actually constitutes as a "restart"? Our Kafka cluster has known performance bottlenecks during certain times of day that we are working to resolve. I do notice Kafka producer timeouts quite a few times during these times. When App hits these timeouts, it does recover fine but I dont necessary see entire application restarting as I dont see bootstrap logs of my App. Does something like this count as a restart of App from Restart Strategy perspective as well vs things like apps crashes/Yarn killing application etc. where App is actually restarted from scratch?

We are really liking Flink, just need to hash out these operational issues to make it prime time for all streaming apps we have in our cluster.

Thanks,

Ashish
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Restart Strategy

Ashish Pokharel
FYI,

I think I have gotten to the bottom this situation. For anyone who might be in situation hopefully my observations will help.

In my case, it had nothing to do with Flink Restart Strategy, it was doing it’s thing as expected. Issue really was, Kafka Producer timeout counters. As I mentioned in other thread, we have a capacity issue with our Kafka cluster that ends up causing some timeout in our Flink Applications (we do have throttle in place in Kafka to manage it better but still we run into timeout pretty often right unfortunately). 

We had set our Kafka Producer retries to 10. It seems like that retry counter never gets reset. So over life of an App if it hits 10 timeouts, it basically couldn’t start and went to a Failed state. I am yet to dig into whether this can be solved from Flink Kafka wrapper or not. But, for now we have set the retries to 0 and hopefully this situation will not happen.

If anyone has any similar observations pl feel free to share.

Thanks, Ashish

On Jan 19, 2018, at 2:43 PM, ashish pok <[hidden email]> wrote:

Team,

Hopefully, this is a quick one. 

We have setup restart strategy as follows in pretty much all of our apps:

    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Time.of(30, TimeUnit.SECONDS)));

This seems pretty straight-forward. App should retry starting 10 times every 30 seconds - so about 5 minutes. Either we are not understanding this or it seems inconsistent. Some of the applications restart and come back fine on issues like Kafka timeout (which I will come back to later) but in some cases same issues pretty much shuts the app down. 

My first guess here was that total count of 10 is not reset after App recovered normally. Is there a need to manually reset the counter in an App? I doubt Flink would be treating it like a counter that spans the life of an App instead of resetting on successful start-up - but not sure how else to explain the behavior.

Along the same line, what actually constitutes as a "restart"? Our Kafka cluster has known performance bottlenecks during certain times of day that we are working to resolve. I do notice Kafka producer timeouts quite a few times during these times. When App hits these timeouts, it does recover fine but I dont necessary see entire application restarting as I dont see bootstrap logs of my App. Does something like this count as a restart of App from Restart Strategy perspective as well vs things like apps crashes/Yarn killing application etc. where App is actually restarted from scratch?

We are really liking Flink, just need to hash out these operational issues to make it prime time for all streaming apps we have in our cluster.

Thanks,

Ashish

Reply | Threaded
Open this post in threaded view
|

Re: Understanding Restart Strategy

Aljoscha Krettek
Thanks for the update!

On 25. Jan 2018, at 04:12, Ashish Pokharel <[hidden email]> wrote:

FYI,

I think I have gotten to the bottom this situation. For anyone who might be in situation hopefully my observations will help.

In my case, it had nothing to do with Flink Restart Strategy, it was doing it’s thing as expected. Issue really was, Kafka Producer timeout counters. As I mentioned in other thread, we have a capacity issue with our Kafka cluster that ends up causing some timeout in our Flink Applications (we do have throttle in place in Kafka to manage it better but still we run into timeout pretty often right unfortunately). 

We had set our Kafka Producer retries to 10. It seems like that retry counter never gets reset. So over life of an App if it hits 10 timeouts, it basically couldn’t start and went to a Failed state. I am yet to dig into whether this can be solved from Flink Kafka wrapper or not. But, for now we have set the retries to 0 and hopefully this situation will not happen.

If anyone has any similar observations pl feel free to share.

Thanks, Ashish

On Jan 19, 2018, at 2:43 PM, ashish pok <[hidden email]> wrote:

Team,

Hopefully, this is a quick one. 

We have setup restart strategy as follows in pretty much all of our apps:

    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Time.of(30, TimeUnit.SECONDS)));

This seems pretty straight-forward. App should retry starting 10 times every 30 seconds - so about 5 minutes. Either we are not understanding this or it seems inconsistent. Some of the applications restart and come back fine on issues like Kafka timeout (which I will come back to later) but in some cases same issues pretty much shuts the app down. 

My first guess here was that total count of 10 is not reset after App recovered normally. Is there a need to manually reset the counter in an App? I doubt Flink would be treating it like a counter that spans the life of an App instead of resetting on successful start-up - but not sure how else to explain the behavior.

Along the same line, what actually constitutes as a "restart"? Our Kafka cluster has known performance bottlenecks during certain times of day that we are working to resolve. I do notice Kafka producer timeouts quite a few times during these times. When App hits these timeouts, it does recover fine but I dont necessary see entire application restarting as I dont see bootstrap logs of my App. Does something like this count as a restart of App from Restart Strategy perspective as well vs things like apps crashes/Yarn killing application etc. where App is actually restarted from scratch?

We are really liking Flink, just need to hash out these operational issues to make it prime time for all streaming apps we have in our cluster.

Thanks,

Ashish