YARN session application attempts

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

YARN session application attempts

stefanobaghino
Hello everybody,

I was asking myself: are there any best practices regarding how to set the `yarn.application-attempts` configuration key when running Flink on YARN as a long-running session? The configuration page on the docs states that 1 is the default and that it is recommended to leave it like that, however in the case of a long running session it seems to me that the value should be higher in order to actually allow the session to keep running despite Job Managers failing.

Furthermore, the HA page on the docs states the following

"""
It’s important to note that yarn.resourcemanager.am.max-attempts is an upper bound for the application restarts. Therfore, the number of application attempts set within Flink cannot exceed the YARN cluster setting with which YARN was started. 
"""

However, after some tests conducted by my colleagues and after looking at the code (FlinkYarnClientBase:522-536) it seems to me that the flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn overrides the fallback value of 1. Is this right? Is the documentation wrong?

--
BR,
Stefano Baghino

Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: YARN session application attempts

Ufuk Celebi
Hey Stefano,

yarn.resourcemanager.am.max-attempts is a setting for your YARN
cluster and cannot be influenced by Flink. Flink cannot set a higher
number than this for yarn.application-attempts.

The key that is set/overriden by Flink is probably only valid for the
YARN session, but I'm not too familiar with the code. Maybe someone
else can chime in.

I would recommend using a newer Hadoop version (>= 2.6), where you can
configure the failure validity interval, which counts the attempts per
time interval, e.g. it is allowed to fail 2 times within X seconds.
Per default, the failure validity interval is configured to the Akka
timeout (which is per default 10s). I actually think it would make
sense to increase this a little and leave the attempts at 1 or 2 (in
the interval).

Does this help?

– Ufuk


On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
<[hidden email]> wrote:

> Hello everybody,
>
> I was asking myself: are there any best practices regarding how to set the
> `yarn.application-attempts` configuration key when running Flink on YARN as
> a long-running session? The configuration page on the docs states that 1 is
> the default and that it is recommended to leave it like that, however in the
> case of a long running session it seems to me that the value should be
> higher in order to actually allow the session to keep running despite Job
> Managers failing.
>
> Furthermore, the HA page on the docs states the following
>
> """
> It’s important to note that yarn.resourcemanager.am.max-attempts is an upper
> bound for the application restarts. Therfore, the number of application
> attempts set within Flink cannot exceed the YARN cluster setting with which
> YARN was started.
> """
>
> However, after some tests conducted by my colleagues and after looking at
> the code (FlinkYarnClientBase:522-536) it seems to me that the
> flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
> overrides the fallback value of 1. Is this right? Is the documentation
> wrong?
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: YARN session application attempts

stefanobaghino
Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind with the ML in the last couple of weeks due to lack of time.
First of all, thanks for taking the time to help me.

Yes, what I was saying was that apparently from the code (and effectively as we later found out after a couple of tests) the "upper bound" cited by the documentation seems invalid (meaning, the number of attempts if effectively regulated by the flink-conf.yaml and falls back to the yarn-site.xml only if missing).

We're currently using Hadoop 2.7.1 so we'll try your solution, thanks.

I was also wondering if there's a way to ask to retry indefinitely, so that a long-running streaming job can endure as many job manager failures as possible without ever needing for a human intervention to restart the YARN session.

On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <[hidden email]> wrote:
Hey Stefano,

yarn.resourcemanager.am.max-attempts is a setting for your YARN
cluster and cannot be influenced by Flink. Flink cannot set a higher
number than this for yarn.application-attempts.

The key that is set/overriden by Flink is probably only valid for the
YARN session, but I'm not too familiar with the code. Maybe someone
else can chime in.

I would recommend using a newer Hadoop version (>= 2.6), where you can
configure the failure validity interval, which counts the attempts per
time interval, e.g. it is allowed to fail 2 times within X seconds.
Per default, the failure validity interval is configured to the Akka
timeout (which is per default 10s). I actually think it would make
sense to increase this a little and leave the attempts at 1 or 2 (in
the interval).

Does this help?

– Ufuk


On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
<[hidden email]> wrote:
> Hello everybody,
>
> I was asking myself: are there any best practices regarding how to set the
> `yarn.application-attempts` configuration key when running Flink on YARN as
> a long-running session? The configuration page on the docs states that 1 is
> the default and that it is recommended to leave it like that, however in the
> case of a long running session it seems to me that the value should be
> higher in order to actually allow the session to keep running despite Job
> Managers failing.
>
> Furthermore, the HA page on the docs states the following
>
> """
> It’s important to note that yarn.resourcemanager.am.max-attempts is an upper
> bound for the application restarts. Therfore, the number of application
> attempts set within Flink cannot exceed the YARN cluster setting with which
> YARN was started.
> """
>
> However, after some tests conducted by my colleagues and after looking at
> the code (FlinkYarnClientBase:522-536) it seems to me that the
> flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
> overrides the fallback value of 1. Is this right? Is the documentation
> wrong?
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit



--
BR,
Stefano Baghino

Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: YARN session application attempts

Till Rohrmann

Hi Stefano,

Hadoop supports this feature since version 2.6.0. You can define a time interval for the maximum number of applications attempt. This means that you have to observe this number of application failures in a time interval before failing the application ultimately. Flink will activate this feature if you’re using Hadoop >= 2.6.0. The failures validity interval will be set to the akka.ask.timeout value (default: 10s).

[1] https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html#setAttemptFailuresValidityInterval(long)

Cheers,
Till


On Tue, Apr 12, 2016 at 11:56 AM, Stefano Baghino <[hidden email]> wrote:
Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind with the ML in the last couple of weeks due to lack of time.
First of all, thanks for taking the time to help me.

Yes, what I was saying was that apparently from the code (and effectively as we later found out after a couple of tests) the "upper bound" cited by the documentation seems invalid (meaning, the number of attempts if effectively regulated by the flink-conf.yaml and falls back to the yarn-site.xml only if missing).

We're currently using Hadoop 2.7.1 so we'll try your solution, thanks.

I was also wondering if there's a way to ask to retry indefinitely, so that a long-running streaming job can endure as many job manager failures as possible without ever needing for a human intervention to restart the YARN session.

On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <[hidden email]> wrote:
Hey Stefano,

yarn.resourcemanager.am.max-attempts is a setting for your YARN
cluster and cannot be influenced by Flink. Flink cannot set a higher
number than this for yarn.application-attempts.

The key that is set/overriden by Flink is probably only valid for the
YARN session, but I'm not too familiar with the code. Maybe someone
else can chime in.

I would recommend using a newer Hadoop version (>= 2.6), where you can
configure the failure validity interval, which counts the attempts per
time interval, e.g. it is allowed to fail 2 times within X seconds.
Per default, the failure validity interval is configured to the Akka
timeout (which is per default 10s). I actually think it would make
sense to increase this a little and leave the attempts at 1 or 2 (in
the interval).

Does this help?

– Ufuk


On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
<[hidden email]> wrote:
> Hello everybody,
>
> I was asking myself: are there any best practices regarding how to set the
> `yarn.application-attempts` configuration key when running Flink on YARN as
> a long-running session? The configuration page on the docs states that 1 is
> the default and that it is recommended to leave it like that, however in the
> case of a long running session it seems to me that the value should be
> higher in order to actually allow the session to keep running despite Job
> Managers failing.
>
> Furthermore, the HA page on the docs states the following
>
> """
> It’s important to note that yarn.resourcemanager.am.max-attempts is an upper
> bound for the application restarts. Therfore, the number of application
> attempts set within Flink cannot exceed the YARN cluster setting with which
> YARN was started.
> """
>
> However, after some tests conducted by my colleagues and after looking at
> the code (FlinkYarnClientBase:522-536) it seems to me that the
> flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
> overrides the fallback value of 1. Is this right? Is the documentation
> wrong?
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit



--
BR,
Stefano Baghino

Software Engineer @ Radicalbit