Hello everybody,
I was asking myself: are there any best practices regarding how to set the `yarn.application-attempts` configuration key when running Flink on YARN as a long-running session? The configuration page on the docs states that 1 is the default and that it is recommended to leave it like that, however in the case of a long running session it seems to me that the value should be higher in order to actually allow the session to keep running despite Job Managers failing. Furthermore, the HA page on the docs states the following """ It’s important to note that yarn.resourcemanager.am.max-attempts is an upper bound for the application restarts. Therfore, the number of application attempts set within Flink cannot exceed the YARN cluster setting with which YARN was started. """ However, after some tests conducted by my colleagues and after looking at the code (FlinkYarnClientBase:522-536) it seems to me that the flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn overrides the fallback value of 1. Is this right? Is the documentation wrong? BR, Stefano Baghino |
Hey Stefano,
yarn.resourcemanager.am.max-attempts is a setting for your YARN cluster and cannot be influenced by Flink. Flink cannot set a higher number than this for yarn.application-attempts. The key that is set/overriden by Flink is probably only valid for the YARN session, but I'm not too familiar with the code. Maybe someone else can chime in. I would recommend using a newer Hadoop version (>= 2.6), where you can configure the failure validity interval, which counts the attempts per time interval, e.g. it is allowed to fail 2 times within X seconds. Per default, the failure validity interval is configured to the Akka timeout (which is per default 10s). I actually think it would make sense to increase this a little and leave the attempts at 1 or 2 (in the interval). Does this help? – Ufuk On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino <[hidden email]> wrote: > Hello everybody, > > I was asking myself: are there any best practices regarding how to set the > `yarn.application-attempts` configuration key when running Flink on YARN as > a long-running session? The configuration page on the docs states that 1 is > the default and that it is recommended to leave it like that, however in the > case of a long running session it seems to me that the value should be > higher in order to actually allow the session to keep running despite Job > Managers failing. > > Furthermore, the HA page on the docs states the following > > """ > It’s important to note that yarn.resourcemanager.am.max-attempts is an upper > bound for the application restarts. Therfore, the number of application > attempts set within Flink cannot exceed the YARN cluster setting with which > YARN was started. > """ > > However, after some tests conducted by my colleagues and after looking at > the code (FlinkYarnClientBase:522-536) it seems to me that the > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn > overrides the fallback value of 1. Is this right? Is the documentation > wrong? > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit |
Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind with the ML in the last couple of weeks due to lack of time. First of all, thanks for taking the time to help me. Yes, what I was saying was that apparently from the code (and effectively as we later found out after a couple of tests) the "upper bound" cited by the documentation seems invalid (meaning, the number of attempts if effectively regulated by the flink-conf.yaml and falls back to the yarn-site.xml only if missing). We're currently using Hadoop 2.7.1 so we'll try your solution, thanks. I was also wondering if there's a way to ask to retry indefinitely, so that a long-running streaming job can endure as many job manager failures as possible without ever needing for a human intervention to restart the YARN session. On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <[hidden email]> wrote: Hey Stefano, BR, Stefano Baghino |
Hi Stefano, Hadoop supports this feature since version 2.6.0. You can define a time interval for the maximum number of applications attempt. This means that you have to observe this number of application failures in a time interval before failing the application ultimately. Flink will activate this feature if you’re using Hadoop >= 2.6.0. The failures validity interval will be set to the Cheers, On Tue, Apr 12, 2016 at 11:56 AM, Stefano Baghino <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |