(DEPRECATED) Apache Flink User Mailing List archive.

Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Classic

List

Threaded

3 messages Options

snntr

Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Hi everyone,

I am currently running a small Flink job locally, which checkpoints
every 100ms.

After a few minutes the JM crashes with an OOME. In the Headump I can
see, that a TimerTask holds references to all completed
CheckpointCoordinators. I assume this task is supposed to clean these
checkpoints up eventually.

First, is this the expected behaviour? Second, is there a configuration
option to trigger this cleanup timer earlier?

Cheers,

Konstantin

--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

signature.asc (849 bytes) Download Attachment

Ufuk Celebi

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

@Konstantion: Could you share a relevant part of the heap dump just to
get a second look?

The timer tasks are responsible to abort the checkpoint if a
checkpoint timeout occurs. You can decrease the timeout via the
CheckpointConfig
(env.getCheckpointConfig().setCheckpointTimeout(long)), the current
default is 10 mins.

On a first skim of the checkpoint coordinator code I didn't see
anything that cancels these tasks when the checkpoint is fully ack'd.
@Stephan: I think we should do that. What do you think?

On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
<[hidden email]> wrote:

> Hi everyone,
>
> I am currently running a small Flink job locally, which checkpoints
> every 100ms.
>
> After a few minutes the JM crashes with an OOME. In the Headump I can
> see, that a TimerTask holds references to all completed
> CheckpointCoordinators. I assume this task is supposed to clean these
> checkpoints up eventually.
>
> First, is this the expected behaviour? Second, is there a configuration
> option to trigger this cleanup timer earlier?
>
> Cheers,
>
> Konstantin
>
> --
> Konstantin Knauf * [hidden email] * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

snntr

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Hi Ufuk,

thank's for looking into it. I have shared the heap dump with you (link
in a separate e-mail). Additionally, attach two screenshots of the dump.

I was actually wrong in my original e-mail, oversaw the "$1" in the
classname. It really seems that it's just the TimerTasks created in
CheckpointCoordinator:453. With a checkpoint interval of 100ms this
means, 600 checkpoints per minute, so 6000 Checkpoints in the jobmanager
until the first TimerTasks (which hold a reference to the checkpoint)
expire. After roughly 4500 checkpoints the OOME happens.

From my understanding, this timer should be deleted as soon as the
checkpoint is completed.

Cheers,

Konstantin

On 28.02.2017 18:16, Ufuk Celebi wrote:

> @Konstantion: Could you share a relevant part of the heap dump just to
> get a second look?
>
> The timer tasks are responsible to abort the checkpoint if a
> checkpoint timeout occurs. You can decrease the timeout via the
> CheckpointConfig
> (env.getCheckpointConfig().setCheckpointTimeout(long)), the current
> default is 10 mins.
>
> On a first skim of the checkpoint coordinator code I didn't see
> anything that cancels these tasks when the checkpoint is fully ack'd.
> @Stephan: I think we should do that. What do you think?
>
> On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
> <[hidden email]> wrote:
>> Hi everyone,
>>
>> I am currently running a small Flink job locally, which checkpoints
>> every 100ms.
>>
>> After a few minutes the JM crashes with an OOME. In the Headump I can
>> see, that a TimerTask holds references to all completed
>> CheckpointCoordinators. I assume this task is supposed to clean these
>> checkpoints up eventually.
>>
>> First, is this the expected behaviour? Second, is there a configuration
>> option to trigger this cleanup timer earlier?
>>
>> Cheers,
>>
>> Konstantin
>>
>> --
>> Konstantin Knauf * [hidden email] * +49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>

--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

jm_dump1.jpg (163K) Download Attachment

jm_dump2.jpg (141K) Download Attachment

signature.asc (849 bytes) Download Attachment