Hi everyone,
we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are experiencing regular TaskManager failures due to [Taskmanager Logs] 2017-07-10 15:25:26,448 ERROR Remoting - Association to [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] irrecoverably failed. Quarantining address. java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at ... As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 the taskmanager should be restarted in this case. In our case YARN does not start a new taskmanager container, but the container is just missing indefinitely. Is it known, that this does not work on YARN 2.4? If it helps, I can also provide the full job and taskmanager logs... Cheers & Thanks, Konstantin -- Konstantin Knauf * [hidden email] * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082 signature.asc (849 bytes) Download Attachment |
Hi Konstantin,
I digged through the linked pull requests (of https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0) but you have that already. In there, it was also mentioned that the quarantine monitor is disabled by default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should detect a quarantined task manager and shut it down. In that case, YARN should notice it and start a new one, if I'm not mistaken. Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled? Nico On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote: > Hi everyone, > > we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). > Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are > experiencing regular TaskManager failures due to > > [Taskmanager Logs] > 2017-07-10 15:25:26,448 ERROR Remoting > - Association to > [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] > irrecoverably failed. Quarantining address. > java.lang.IllegalStateException: Error encountered while processing > system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] > at > akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi > nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) > at ... > > As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 > the taskmanager should be restarted in this case. In our case YARN does > not start a new taskmanager container, but the container is just missing > indefinitely. Is it known, that this does not work on YARN 2.4? > > If it helps, I can also provide the full job and taskmanager logs... > > Cheers & Thanks, > > Konstantin signature.asc (201 bytes) Download Attachment |
Hi Nico,
thanks for the quick response! No, this was note enabled :( Since we are in the process of upgrading to 1.3.1: I did not find this option in 1.3, only 1.2. Is this the default behaviour in 1.3 or is this configuration just not documented? Cheers, Konstantin On 03.08.2017 17:11, Nico Kruber wrote: > Hi Konstantin, > I digged through the linked pull requests (of https://issues.apache.org/jira/ > browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong > (should have been 1.2.1, not 1.2.0) but you have that already. > > In there, it was also mentioned that the quarantine monitor is disabled by > default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error` > to true. If enabled, it should detect a quarantined task manager and shut it > down. In that case, YARN should notice it and start a new one, if I'm not > mistaken. > > Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled? > > > Nico > > On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote: >> Hi everyone, >> >> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). >> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are >> experiencing regular TaskManager failures due to >> >> [Taskmanager Logs] >> 2017-07-10 15:25:26,448 ERROR Remoting >> - Association to >> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] >> irrecoverably failed. Quarantining address. >> java.lang.IllegalStateException: Error encountered while processing >> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] >> at >> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi >> nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) >> at ... >> >> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 >> the taskmanager should be restarted in this case. In our case YARN does >> not start a new taskmanager container, but the container is just missing >> indefinitely. Is it known, that this does not work on YARN 2.4? >> >> If it helps, I can also provide the full job and taskmanager logs... >> >> Cheers & Thanks, >> >> Konstantin > Konstantin Knauf * [hidden email] * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082 signature.asc (849 bytes) Download Attachment |
Hi Konstantin,
I just checked the code and the configuration option is still there and should be working. Somehow, the backport for the 1.2 release branch did contain the documentation while the actual commit on master did not. Thanks for the info, let me create a hotfix to fix that. Nico On Thursday, 3 August 2017 18:05:29 CEST Konstantin Knauf wrote: > Hi Nico, > > thanks for the quick response! No, this was note enabled :( Since we are > in the process of upgrading to 1.3.1: I did not find this option in 1.3, > only 1.2. Is this the default behaviour in 1.3 or is this configuration > just not documented? > > Cheers, > > Konstantin > > On 03.08.2017 17:11, Nico Kruber wrote: > > Hi Konstantin, > > I digged through the linked pull requests (of > > https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice > > that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0) > > but you have that already. > > > > In there, it was also mentioned that the quarantine monitor is disabled by > > default and can be enabled by setting > > `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should > > detect a quarantined task manager and shut it down. In that case, YARN > > should notice it and start a new one, if I'm not mistaken. > > > > Are you already working with `taskmanager.exit-on-fatal-akka-error` > > enabled? > > > > > > Nico > > > > On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote: > >> Hi everyone, > >> > >> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). > >> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are > >> experiencing regular TaskManager failures due to > >> > >> [Taskmanager Logs] > >> 2017-07-10 15:25:26,448 ERROR Remoting > >> > >> - Association to > >> > >> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] > >> irrecoverably failed. Quarantining address. > >> java.lang.IllegalStateException: Error encountered while processing > >> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] > >> > >> at > >> > >> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(End > >> poi nt.scala:289) at > >> akka.actor.Actor$class.aroundReceive(Actor.scala:467)>> > >> at ... > >> > >> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 > >> the taskmanager should be restarted in this case. In our case YARN does > >> not start a new taskmanager container, but the container is just missing > >> indefinitely. Is it known, that this does not work on YARN 2.4? > >> > >> If it helps, I can also provide the full job and taskmanager logs... > >> > >> Cheers & Thanks, > >> > >> Konstantin signature.asc (201 bytes) Download Attachment |
Hi Konstantin,
If you can at all wait, I would suggest to skip updating to 1.3.1 and go directly to (the not yet released) 1.3.2. Flink 1.3.0 and 1.3.1 had a few critical bugs that are not fixed. Most notably, there was a problem in the Kafka consumer that could lead to state corruption/data duplication and incremental RocksDB checkpoints were not working correctly in some cases. The vote for 1.3.2 is currently ongoing and the release should happen tomorrow or by Monday at the latest. Best, Aljoscha > On 4. Aug 2017, at 11:09, Nico Kruber <[hidden email]> wrote: > > Hi Konstantin, > I just checked the code and the configuration option is still there and should > be working. Somehow, the backport for the 1.2 release branch did contain the > documentation while the actual commit on master did not. > Thanks for the info, let me create a hotfix to fix that. > > > Nico > > On Thursday, 3 August 2017 18:05:29 CEST Konstantin Knauf wrote: >> Hi Nico, >> >> thanks for the quick response! No, this was note enabled :( Since we are >> in the process of upgrading to 1.3.1: I did not find this option in 1.3, >> only 1.2. Is this the default behaviour in 1.3 or is this configuration >> just not documented? >> >> Cheers, >> >> Konstantin >> >> On 03.08.2017 17:11, Nico Kruber wrote: >>> Hi Konstantin, >>> I digged through the linked pull requests (of >>> https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice >>> that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0) >>> but you have that already. >>> >>> In there, it was also mentioned that the quarantine monitor is disabled by >>> default and can be enabled by setting >>> `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should >>> detect a quarantined task manager and shut it down. In that case, YARN >>> should notice it and start a new one, if I'm not mistaken. >>> >>> Are you already working with `taskmanager.exit-on-fatal-akka-error` >>> enabled? >>> >>> >>> Nico >>> >>> On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote: >>>> Hi everyone, >>>> >>>> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). >>>> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are >>>> experiencing regular TaskManager failures due to >>>> >>>> [Taskmanager Logs] >>>> 2017-07-10 15:25:26,448 ERROR Remoting >>>> >>>> - Association to >>>> >>>> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] >>>> irrecoverably failed. Quarantining address. >>>> java.lang.IllegalStateException: Error encountered while processing >>>> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] >>>> >>>> at >>>> >>>> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(End >>>> poi nt.scala:289) at >>>> akka.actor.Actor$class.aroundReceive(Actor.scala:467)>> >>>> at ... >>>> >>>> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 >>>> the taskmanager should be restarted in this case. In our case YARN does >>>> not start a new taskmanager container, but the container is just missing >>>> indefinitely. Is it known, that this does not work on YARN 2.4? >>>> >>>> If it helps, I can also provide the full job and taskmanager logs... >>>> >>>> Cheers & Thanks, >>>> >>>> Konstantin > |
Free forum by Nabble | Edit this page |