Flink 1.2 Jobmanager OOME - CheckpointCoordinators

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.2 Jobmanager OOME - CheckpointCoordinators

snntr
Hi everyone,

I am currently running a small Flink job locally, which checkpoints
every 100ms.

After a few minutes the JM crashes with an OOME. In the Headump I can
see, that a TimerTask holds references to all completed
CheckpointCoordinators. I assume this task is supposed to clean these
checkpoints up eventually.

First, is this the expected behaviour? Second, is there a configuration
option to trigger this cleanup timer earlier?

Cheers,

Konstantin

--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Ufuk Celebi
@Konstantion: Could you share a relevant part of the heap dump just to
get a second look?

The timer tasks are responsible to abort the checkpoint if a
checkpoint timeout occurs. You can decrease the timeout via the
CheckpointConfig
(env.getCheckpointConfig().setCheckpointTimeout(long)), the current
default is 10 mins.

On a first skim of the checkpoint coordinator code I didn't see
anything that cancels these tasks when the checkpoint is fully ack'd.
@Stephan: I think we should do that. What do you think?

On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
<[hidden email]> wrote:

> Hi everyone,
>
> I am currently running a small Flink job locally, which checkpoints
> every 100ms.
>
> After a few minutes the JM crashes with an OOME. In the Headump I can
> see, that a TimerTask holds references to all completed
> CheckpointCoordinators. I assume this task is supposed to clean these
> checkpoints up eventually.
>
> First, is this the expected behaviour? Second, is there a configuration
> option to trigger this cleanup timer earlier?
>
> Cheers,
>
> Konstantin
>
> --
> Konstantin Knauf * [hidden email] * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

snntr
Hi Ufuk,

thank's for looking into it. I have shared the heap dump with you (link
in a separate e-mail). Additionally, attach two screenshots of the dump.

I was actually wrong in my original e-mail, oversaw the "$1" in the
classname. It really seems that it's just the TimerTasks created in
CheckpointCoordinator:453. With a checkpoint interval of 100ms this
means, 600 checkpoints per minute, so 6000 Checkpoints in the jobmanager
until the first TimerTasks (which hold a reference to the checkpoint)
expire. After roughly 4500 checkpoints the OOME happens.

From my understanding, this timer should be deleted as soon as the
checkpoint is completed.

Cheers,

Konstantin


On 28.02.2017 18:16, Ufuk Celebi wrote:

> @Konstantion: Could you share a relevant part of the heap dump just to
> get a second look?
>
> The timer tasks are responsible to abort the checkpoint if a
> checkpoint timeout occurs. You can decrease the timeout via the
> CheckpointConfig
> (env.getCheckpointConfig().setCheckpointTimeout(long)), the current
> default is 10 mins.
>
> On a first skim of the checkpoint coordinator code I didn't see
> anything that cancels these tasks when the checkpoint is fully ack'd.
> @Stephan: I think we should do that. What do you think?
>
> On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
> <[hidden email]> wrote:
>> Hi everyone,
>>
>> I am currently running a small Flink job locally, which checkpoints
>> every 100ms.
>>
>> After a few minutes the JM crashes with an OOME. In the Headump I can
>> see, that a TimerTask holds references to all completed
>> CheckpointCoordinators. I assume this task is supposed to clean these
>> checkpoints up eventually.
>>
>> First, is this the expected behaviour? Second, is there a configuration
>> option to trigger this cleanup timer earlier?
>>
>> Cheers,
>>
>> Konstantin
>>
>> --
>> Konstantin Knauf * [hidden email] * +49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>
--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

jm_dump1.jpg (163K) Download Attachment
jm_dump2.jpg (141K) Download Attachment
signature.asc (849 bytes) Download Attachment