Hi everyone,
I am currently running a small Flink job locally, which checkpoints every 100ms. After a few minutes the JM crashes with an OOME. In the Headump I can see, that a TimerTask holds references to all completed CheckpointCoordinators. I assume this task is supposed to clean these checkpoints up eventually. First, is this the expected behaviour? Second, is there a configuration option to trigger this cleanup timer earlier? Cheers, Konstantin -- Konstantin Knauf * [hidden email] * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 signature.asc (849 bytes) Download Attachment |
@Konstantion: Could you share a relevant part of the heap dump just to
get a second look? The timer tasks are responsible to abort the checkpoint if a checkpoint timeout occurs. You can decrease the timeout via the CheckpointConfig (env.getCheckpointConfig().setCheckpointTimeout(long)), the current default is 10 mins. On a first skim of the checkpoint coordinator code I didn't see anything that cancels these tasks when the checkpoint is fully ack'd. @Stephan: I think we should do that. What do you think? On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf <[hidden email]> wrote: > Hi everyone, > > I am currently running a small Flink job locally, which checkpoints > every 100ms. > > After a few minutes the JM crashes with an OOME. In the Headump I can > see, that a TimerTask holds references to all completed > CheckpointCoordinators. I assume this task is supposed to clean these > checkpoints up eventually. > > First, is this the expected behaviour? Second, is there a configuration > option to trigger this cleanup timer earlier? > > Cheers, > > Konstantin > > -- > Konstantin Knauf * [hidden email] * +49-174-3413182 > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke > Sitz: Unterföhring * Amtsgericht München * HRB 135082 > |
Hi Ufuk,
thank's for looking into it. I have shared the heap dump with you (link in a separate e-mail). Additionally, attach two screenshots of the dump. I was actually wrong in my original e-mail, oversaw the "$1" in the classname. It really seems that it's just the TimerTasks created in CheckpointCoordinator:453. With a checkpoint interval of 100ms this means, 600 checkpoints per minute, so 6000 Checkpoints in the jobmanager until the first TimerTasks (which hold a reference to the checkpoint) expire. After roughly 4500 checkpoints the OOME happens. From my understanding, this timer should be deleted as soon as the checkpoint is completed. Cheers, Konstantin On 28.02.2017 18:16, Ufuk Celebi wrote: > @Konstantion: Could you share a relevant part of the heap dump just to > get a second look? > > The timer tasks are responsible to abort the checkpoint if a > checkpoint timeout occurs. You can decrease the timeout via the > CheckpointConfig > (env.getCheckpointConfig().setCheckpointTimeout(long)), the current > default is 10 mins. > > On a first skim of the checkpoint coordinator code I didn't see > anything that cancels these tasks when the checkpoint is fully ack'd. > @Stephan: I think we should do that. What do you think? > > On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf > <[hidden email]> wrote: >> Hi everyone, >> >> I am currently running a small Flink job locally, which checkpoints >> every 100ms. >> >> After a few minutes the JM crashes with an OOME. In the Headump I can >> see, that a TimerTask holds references to all completed >> CheckpointCoordinators. I assume this task is supposed to clean these >> checkpoints up eventually. >> >> First, is this the expected behaviour? Second, is there a configuration >> option to trigger this cleanup timer earlier? >> >> Cheers, >> >> Konstantin >> >> -- >> Konstantin Knauf * [hidden email] * +49-174-3413182 >> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >> > Konstantin Knauf * [hidden email] * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 jm_dump1.jpg (163K) Download Attachment jm_dump2.jpg (141K) Download Attachment signature.asc (849 bytes) Download Attachment |
Free forum by Nabble | Edit this page |