failure-rate restart strategy not working?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

failure-rate restart strategy not working?

Shannon Carey
I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting the restart-strategy programmatically, but the job does have checkpointing enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by throwing a RuntimeException, it continues to restart beyond the limit of 3 failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon
Reply | Threaded
Open this post in threaded view
|

Re: failure-rate restart strategy not working?

Shannon Carey
I think I figured it out: the problem is due to Flink's behavior when a job has checkpointing enabled.

When the job graph is created, if checkpointing is enabled but a restart strategy hasn't been programmatically configured, Flink changes the job graph's execution config to use the fixed delay restart strategy. That gets serialized with the job graph. Then, when the JobManager deserializes the execution config, it sees that there's a restart strategy configured for the job and uses that instead of using the restart strategy that's configured on the cluster.

Clearly, the documentation definitely needs to be adjusted. Maybe I can add some changes to https://github.com/apache/flink/pull/3059

However, should we also consider some implementation changes? Is it intentional that enabling checkpoint overrides the restart strategy set on the cluster, and that the only way to control the restart strategy on a checkpointed job is to set it programmatically? If not, then would it be reasonable to only set fixed-delay restart strategy if checkpointing is enabled AND the cluster doesn't explicitly configure it? Flink would no longer be use the execution config to control the strategy, but would instead do it in the JobManager#submitJob().

-Shannon

From: Shannon Carey <[hidden email]>
Date: Thursday, January 5, 2017 at 1:50 PM
To: "[hidden email]" <[hidden email]>
Subject: failure-rate restart strategy not working?

I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting the restart-strategy programmatically, but the job does have checkpointing enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by throwing a RuntimeException, it continues to restart beyond the limit of 3 failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon
Reply | Threaded
Open this post in threaded view
|

Re: failure-rate restart strategy not working?

Stephan Ewen
I think you are right, enabling checkpointing should not override the cluster settings per se.

This is probably an unwanted artifact of the was that configuration currently works: Setting explicitly set in the program trump the cluster-defaults (in the config). Since activating checkpointing sets a strategy in the ExecutionConfig (program), it overrides the cluster default.

It is definitely not intended in that case. For that specific case, it makes to simply leave the restart strategy "undefined" and use the "fixed delay" one at runtime if none other is specified.

Stephan




On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <[hidden email]> wrote:
I think I figured it out: the problem is due to Flink's behavior when a job has checkpointing enabled.

When the job graph is created, if checkpointing is enabled but a restart strategy hasn't been programmatically configured, Flink changes the job graph's execution config to use the fixed delay restart strategy. That gets serialized with the job graph. Then, when the JobManager deserializes the execution config, it sees that there's a restart strategy configured for the job and uses that instead of using the restart strategy that's configured on the cluster.

Clearly, the documentation definitely needs to be adjusted. Maybe I can add some changes to https://github.com/apache/flink/pull/3059

However, should we also consider some implementation changes? Is it intentional that enabling checkpoint overrides the restart strategy set on the cluster, and that the only way to control the restart strategy on a checkpointed job is to set it programmatically? If not, then would it be reasonable to only set fixed-delay restart strategy if checkpointing is enabled AND the cluster doesn't explicitly configure it? Flink would no longer be use the execution config to control the strategy, but would instead do it in the JobManager#submitJob().

-Shannon

From: Shannon Carey <[hidden email]>
Date: Thursday, January 5, 2017 at 1:50 PM
To: "[hidden email]" <[hidden email]>
Subject: failure-rate restart strategy not working?

I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting the restart-strategy programmatically, but the job does have checkpointing enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by throwing a RuntimeException, it continues to restart beyond the limit of 3 failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon

Reply | Threaded
Open this post in threaded view
|

Re: failure-rate restart strategy not working?

Aljoscha Krettek
Hi,
did you create a Jira issue for this? (I'm just getting up to speed after vacation so sorry if you already did this, I didn't yet read the Jira mail.)

Cheers,
Aljoscah

On Fri, 6 Jan 2017 at 19:08 Stephan Ewen <[hidden email]> wrote:
I think you are right, enabling checkpointing should not override the cluster settings per se.

This is probably an unwanted artifact of the was that configuration currently works: Setting explicitly set in the program trump the cluster-defaults (in the config). Since activating checkpointing sets a strategy in the ExecutionConfig (program), it overrides the cluster default.

It is definitely not intended in that case. For that specific case, it makes to simply leave the restart strategy "undefined" and use the "fixed delay" one at runtime if none other is specified.

Stephan




On Fri, Jan 6, 2017 at 6:44 PM, Shannon Carey <[hidden email]> wrote:
I think I figured it out: the problem is due to Flink's behavior when a job has checkpointing enabled.

When the job graph is created, if checkpointing is enabled but a restart strategy hasn't been programmatically configured, Flink changes the job graph's execution config to use the fixed delay restart strategy. That gets serialized with the job graph. Then, when the JobManager deserializes the execution config, it sees that there's a restart strategy configured for the job and uses that instead of using the restart strategy that's configured on the cluster.

Clearly, the documentation definitely needs to be adjusted. Maybe I can add some changes to https://github.com/apache/flink/pull/3059

However, should we also consider some implementation changes? Is it intentional that enabling checkpoint overrides the restart strategy set on the cluster, and that the only way to control the restart strategy on a checkpointed job is to set it programmatically? If not, then would it be reasonable to only set fixed-delay restart strategy if checkpointing is enabled AND the cluster doesn't explicitly configure it? Flink would no longer be use the execution config to control the strategy, but would instead do it in the JobManager#submitJob().

-Shannon

From: Shannon Carey <[hidden email]>
Date: Thursday, January 5, 2017 at 1:50 PM
To: "[hidden email]" <[hidden email]>
Subject: failure-rate restart strategy not working?

I recently updated my cluster with the following config:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s

I see the settings inside the JobManager web UI, as expected. I am not setting the restart-strategy programmatically, but the job does have checkpointing enabled.

However, if I launch a job that (intentionally) fails every 10 seconds by throwing a RuntimeException, it continues to restart beyond the limit of 3 failures.

Does anyone know why this might be happening? Any ideas of things I could check?

Thanks!
Shannon