Akka Quarantine & Old YARN Versions

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Akka Quarantine & Old YARN Versions

snntr
Hi everyone,

we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
experiencing regular TaskManager failures due to

[Taskmanager Logs]
2017-07-10 15:25:26,448 ERROR Remoting
                   - Association to
[akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
irrecoverably failed. Quarantining address.
java.lang.IllegalStateException: Error encountered while processing
system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
        at
akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:289)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at ...

As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
the taskmanager should be restarted in this case. In our case YARN does
not start a new taskmanager container, but the container is just missing
indefinitely. Is it known, that this does not work on YARN 2.4?

If it helps, I can also provide the full job and taskmanager logs...

Cheers & Thanks,

Konstantin

--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Akka Quarantine & Old YARN Versions

Nico Kruber
Hi Konstantin,
I digged through the linked pull requests (of https://issues.apache.org/jira/
browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong
(should have been 1.2.1, not 1.2.0) but you have that already.

In there, it was also mentioned that the quarantine monitor is disabled by
default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error`
to true. If enabled, it should detect a quarantined task manager and shut it
down. In that case, YARN should notice it and start a new one, if I'm not
mistaken.

Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled?


Nico

On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:

> Hi everyone,
>
> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
> experiencing regular TaskManager failures due to
>
> [Taskmanager Logs]
> 2017-07-10 15:25:26,448 ERROR Remoting
>                    - Association to
> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
> irrecoverably failed. Quarantining address.
> java.lang.IllegalStateException: Error encountered while processing
> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
>         at
> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi
> nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>         at ...
>
> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
> the taskmanager should be restarted in this case. In our case YARN does
> not start a new taskmanager container, but the container is just missing
> indefinitely. Is it known, that this does not work on YARN 2.4?
>
> If it helps, I can also provide the full job and taskmanager logs...
>
> Cheers & Thanks,
>
> Konstantin


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Akka Quarantine & Old YARN Versions

snntr
Hi Nico,

thanks for the quick response! No, this was note enabled :( Since we are
in the process of upgrading to 1.3.1: I did not find this option in 1.3,
only 1.2. Is this the default behaviour in 1.3 or is this configuration
just not documented?

Cheers,

Konstantin

On 03.08.2017 17:11, Nico Kruber wrote:

> Hi Konstantin,
> I digged through the linked pull requests (of https://issues.apache.org/jira/
> browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong
> (should have been 1.2.1, not 1.2.0) but you have that already.
>
> In there, it was also mentioned that the quarantine monitor is disabled by
> default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error`
> to true. If enabled, it should detect a quarantined task manager and shut it
> down. In that case, YARN should notice it and start a new one, if I'm not
> mistaken.
>
> Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled?
>
>
> Nico
>
> On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:
>> Hi everyone,
>>
>> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
>> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
>> experiencing regular TaskManager failures due to
>>
>> [Taskmanager Logs]
>> 2017-07-10 15:25:26,448 ERROR Remoting
>>                    - Association to
>> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
>> irrecoverably failed. Quarantining address.
>> java.lang.IllegalStateException: Error encountered while processing
>> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
>>         at
>> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi
>> nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>         at ...
>>
>> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
>> the taskmanager should be restarted in this case. In our case YARN does
>> not start a new taskmanager container, but the container is just missing
>> indefinitely. Is it known, that this does not work on YARN 2.4?
>>
>> If it helps, I can also provide the full job and taskmanager logs...
>>
>> Cheers & Thanks,
>>
>> Konstantin
>
--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Akka Quarantine & Old YARN Versions

Nico Kruber
Hi Konstantin,
I just checked the code and the configuration option is still there and should
be working. Somehow, the backport for the 1.2 release branch did contain the
documentation while the actual commit on master did not.
Thanks for the info, let me create a hotfix to fix that.


Nico

On Thursday, 3 August 2017 18:05:29 CEST Konstantin Knauf wrote:

> Hi Nico,
>
> thanks for the quick response! No, this was note enabled :( Since we are
> in the process of upgrading to 1.3.1: I did not find this option in 1.3,
> only 1.2. Is this the default behaviour in 1.3 or is this configuration
> just not documented?
>
> Cheers,
>
> Konstantin
>
> On 03.08.2017 17:11, Nico Kruber wrote:
> > Hi Konstantin,
> > I digged through the linked pull requests (of
> > https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice
> > that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0)
> > but you have that already.
> >
> > In there, it was also mentioned that the quarantine monitor is disabled by
> > default and can be enabled by setting
> > `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should
> > detect a quarantined task manager and shut it down. In that case, YARN
> > should notice it and start a new one, if I'm not mistaken.
> >
> > Are you already working with `taskmanager.exit-on-fatal-akka-error`
> > enabled?
> >
> >
> > Nico
> >
> > On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:
> >> Hi everyone,
> >>
> >> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
> >> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
> >> experiencing regular TaskManager failures due to
> >>
> >> [Taskmanager Logs]
> >> 2017-07-10 15:25:26,448 ERROR Remoting
> >>
> >>                    - Association to
> >>
> >> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
> >> irrecoverably failed. Quarantining address.
> >> java.lang.IllegalStateException: Error encountered while processing
> >> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
> >>
> >>         at
> >>
> >> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(End
> >> poi nt.scala:289) at
> >> akka.actor.Actor$class.aroundReceive(Actor.scala:467)>>
> >>         at ...
> >>
> >> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
> >> the taskmanager should be restarted in this case. In our case YARN does
> >> not start a new taskmanager container, but the container is just missing
> >> indefinitely. Is it known, that this does not work on YARN 2.4?
> >>
> >> If it helps, I can also provide the full job and taskmanager logs...
> >>
> >> Cheers & Thanks,
> >>
> >> Konstantin


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Akka Quarantine & Old YARN Versions

Aljoscha Krettek
Hi Konstantin,

If you can at all wait, I would suggest to skip updating to 1.3.1 and go directly to (the not yet released) 1.3.2. Flink 1.3.0 and 1.3.1 had a few critical bugs that are not fixed. Most notably, there was a problem in the Kafka consumer that could lead to state corruption/data duplication and incremental RocksDB checkpoints were not working correctly in some cases.

The vote for 1.3.2 is currently ongoing and the release should happen tomorrow or by Monday at the latest.

Best,
Aljoscha

> On 4. Aug 2017, at 11:09, Nico Kruber <[hidden email]> wrote:
>
> Hi Konstantin,
> I just checked the code and the configuration option is still there and should
> be working. Somehow, the backport for the 1.2 release branch did contain the
> documentation while the actual commit on master did not.
> Thanks for the info, let me create a hotfix to fix that.
>
>
> Nico
>
> On Thursday, 3 August 2017 18:05:29 CEST Konstantin Knauf wrote:
>> Hi Nico,
>>
>> thanks for the quick response! No, this was note enabled :( Since we are
>> in the process of upgrading to 1.3.1: I did not find this option in 1.3,
>> only 1.2. Is this the default behaviour in 1.3 or is this configuration
>> just not documented?
>>
>> Cheers,
>>
>> Konstantin
>>
>> On 03.08.2017 17:11, Nico Kruber wrote:
>>> Hi Konstantin,
>>> I digged through the linked pull requests (of
>>> https://issues.apache.org/jira/ browse/FLINK-3347) a bit just to notice
>>> that the fix-version tag was wrong (should have been 1.2.1, not 1.2.0)
>>> but you have that already.
>>>
>>> In there, it was also mentioned that the quarantine monitor is disabled by
>>> default and can be enabled by setting
>>> `taskmanager.exit-on-fatal-akka-error` to true. If enabled, it should
>>> detect a quarantined task manager and shut it down. In that case, YARN
>>> should notice it and start a new one, if I'm not mistaken.
>>>
>>> Are you already working with `taskmanager.exit-on-fatal-akka-error`
>>> enabled?
>>>
>>>
>>> Nico
>>>
>>> On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote:
>>>> Hi everyone,
>>>>
>>>> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
>>>> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
>>>> experiencing regular TaskManager failures due to
>>>>
>>>> [Taskmanager Logs]
>>>> 2017-07-10 15:25:26,448 ERROR Remoting
>>>>
>>>>                   - Association to
>>>>
>>>> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
>>>> irrecoverably failed. Quarantining address.
>>>> java.lang.IllegalStateException: Error encountered while processing
>>>> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
>>>>
>>>>        at
>>>>
>>>> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(End
>>>> poi nt.scala:289) at
>>>> akka.actor.Actor$class.aroundReceive(Actor.scala:467)>>
>>>>        at ...
>>>>
>>>> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
>>>> the taskmanager should be restarted in this case. In our case YARN does
>>>> not start a new taskmanager container, but the container is just missing
>>>> indefinitely. Is it known, that this does not work on YARN 2.4?
>>>>
>>>> If it helps, I can also provide the full job and taskmanager logs...
>>>>
>>>> Cheers & Thanks,
>>>>
>>>> Konstantin
>