Flink and swapping question

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink and swapping question

Flavio Pompermaier
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Greg Hogan
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255    2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255    2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255    2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]> wrote:
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap memory, the OS decides to kill the Flink TM :(

Any idea of what's going on here?

On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255    2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255    2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255    2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]> wrote:
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
I can confirm that after giving less memory to the Flink TM the job was able to run successfully.
After almost 2 weeks of pain, we summarize here our experience with Fink in virtualized environments (such as VMWare ESXi):
  1. Disable the virtualization "feature" that transfer a VM from a (heavy loaded) physical machine to another one (to balance the resource consumption)
  2. Check dmesg when a TM dies without logging anything (usually it goes OOM and the OS kills it but there you can find the log of this thing)
  3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I see the OS starting swapping also if there are 12 out of 32 GB of free memory)!
We're still investigating how this behavior could be fixed: the problem is that it's better not to disable swapping because otherwise VMWare could start ballooning (that is definitely worse...).

I hope this tips could save someone else's day..

Best,
Flavio

On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap memory, the OS decides to kill the Flink TM :(

Any idea of what's going on here?

On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255    2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255    2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255    2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]> wrote:
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
Hi to all,
I think we found the root cause of all the problems. Looking ad dmesg there was a "crazy" total-vm size associated to the OOM error, a LOT much bigger than the TaskManager's available memory.
In our case, the TM had a max heap of 14 GB while the dmsg error was reporting a required amount of memory in the order of 60 GB!

[ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or sacrifice child
[ 5331.992619] Killed process 24221 (java) total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB

That wasn't definitively possible usin an ordinary JVM (and our TM was running without off-heap settings) so we've looked at the parameters used to run the TM JVM and indeed there was a reall huge amount of memory given to MaxDirectMemorySize. With my big surprise Flink runs a TM with this parameter set to 8.388.607T..does it make any sense??
Is it documented anywhere the importance of this parameter (and why it is used in non off-heap mode as well)? Is it related to network buffers?
It should also be documented that this parameter should be added to the TM heap when reserving memory to Flin (IMHO).

I hope that this painful sessions of Flink troubleshooting could be an added value sooner or later..

Best,
Flavio

On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]> wrote:
I can confirm that after giving less memory to the Flink TM the job was able to run successfully.
After almost 2 weeks of pain, we summarize here our experience with Fink in virtualized environments (such as VMWare ESXi):
  1. Disable the virtualization "feature" that transfer a VM from a (heavy loaded) physical machine to another one (to balance the resource consumption)
  2. Check dmesg when a TM dies without logging anything (usually it goes OOM and the OS kills it but there you can find the log of this thing)
  3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I see the OS starting swapping also if there are 12 out of 32 GB of free memory)!
We're still investigating how this behavior could be fixed: the problem is that it's better not to disable swapping because otherwise VMWare could start ballooning (that is definitely worse...).

I hope this tips could save someone else's day..

Best,
Flavio

On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap memory, the OS decides to kill the Flink TM :(

Any idea of what's going on here?

On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255    2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255    2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255    2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]> wrote:
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Aljoscha Krettek
Hi Flavio,

Is this running on YARN or bare metal? Did you manage to find out where this insanely large parameter is coming from?

Best,
Aljoscha

On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]> wrote:

Hi to all,
I think we found the root cause of all the problems. Looking ad dmesg there was a "crazy" total-vm size associated to the OOM error, a LOT much bigger than the TaskManager's available memory.
In our case, the TM had a max heap of 14 GB while the dmsg error was reporting a required amount of memory in the order of 60 GB!

[ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or sacrifice child
[ 5331.992619] Killed process 24221 (java) total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB

That wasn't definitively possible usin an ordinary JVM (and our TM was running without off-heap settings) so we've looked at the parameters used to run the TM JVM and indeed there was a reall huge amount of memory given to MaxDirectMemorySize. With my big surprise Flink runs a TM with this parameter set to 8.388.607T..does it make any sense??
Is it documented anywhere the importance of this parameter (and why it is used in non off-heap mode as well)? Is it related to network buffers?
It should also be documented that this parameter should be added to the TM heap when reserving memory to Flin (IMHO).

I hope that this painful sessions of Flink troubleshooting could be an added value sooner or later..

Best,
Flavio

On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]> wrote:
I can confirm that after giving less memory to the Flink TM the job was able to run successfully.
After almost 2 weeks of pain, we summarize here our experience with Fink in virtualized environments (such as VMWare ESXi):
  1. Disable the virtualization "feature" that transfer a VM from a (heavy loaded) physical machine to another one (to balance the resource consumption)
  2. Check dmesg when a TM dies without logging anything (usually it goes OOM and the OS kills it but there you can find the log of this thing)
  3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I see the OS starting swapping also if there are 12 out of 32 GB of free memory)!
We're still investigating how this behavior could be fixed: the problem is that it's better not to disable swapping because otherwise VMWare could start ballooning (that is definitely worse...).

I hope this tips could save someone else's day..

Best,
Flavio

On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap memory, the OS decides to kill the Flink TM :(

Any idea of what's going on here?

On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255    2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255    2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255    2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255   33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]> wrote:
Hi Flavio,

Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash. 
In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error.

Is that possible or not?

Best,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Nico Kruber
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:

> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1  
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234  
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255  
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1  
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:

1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific  job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memory required by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to be less affected by the problem )
3. Although the memory consumption is much better this way, the Flink TM memory continuously grow job after job (of this problematic type): we set TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is that possible?

My fear is that there's some annoying memory leak / bad memory allocation in the Flink network level, but I can't have any evidence of this (except the fact that the vm which doesn't have a hdfs datanode underneath the Flink TM is the one with the biggest TM virtual memory consumption).

Thanks for the help ,
Flavio

On 29 May 2017 15:37, "Nico Kruber" <[hidden email]> wrote:
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Fabian Hueske-2
Hi Flavio,

can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?

Thanks, Fabian

2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:

1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific  job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memory required by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to be less affected by the problem )
3. Although the memory consumption is much better this way, the Flink TM memory continuously grow job after job (of this problematic type): we set TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is that possible?

My fear is that there's some annoying memory leak / bad memory allocation in the Flink network level, but I can't have any evidence of this (except the fact that the vm which doesn't have a hdfs datanode underneath the Flink TM is the one with the biggest TM virtual memory consumption).

Thanks for the help ,
Flavio

On 29 May 2017 15:37, "Nico Kruber" <[hidden email]> wrote:
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Stephan Ewen
Hi!

I would actually be surprised if this is an issue in core Flink.

  - The MaxDirectMemory parameter is pretty meaningless, it really is a max and does not have an impact on how much is actually allocated.

  - In most cases we had reported so far, the leak was in a library that was used in the user code

  - If you do not use offheap memory in Flink, then there are few other culprits that can cause high virtual memory consumption:
      - Netty, if you bumped the Netty version in a custom build
      - Flink's Netty, if the job has a crazy high number of concurrent network shuffles (we are talking 1000s here)
      - Some old Java versions have I/O memory leaks (I think some older Java 6 and Java 7 versions were affected)


To diagnose that better:

  - Are these batch or streaming jobs? 
  - If it is streaming, which state backend are you using?

Stephan


On Tue, Jun 6, 2017 at 12:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?

Thanks, Fabian

2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:

1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific  job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memory required by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to be less affected by the problem )
3. Although the memory consumption is much better this way, the Flink TM memory continuously grow job after job (of this problematic type): we set TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is that possible?

My fear is that there's some annoying memory leak / bad memory allocation in the Flink network level, but I can't have any evidence of this (except the fact that the vm which doesn't have a hdfs datanode underneath the Flink TM is the one with the biggest TM virtual memory consumption).

Thanks for the help ,
Flavio

On 29 May 2017 15:37, "Nico Kruber" <[hidden email]> wrote:
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
Hi Stephan,
I also think that the error is more related to netty.
The only suspicious library I use are parquet or thrift.
I'm not using off-heap memory.
What do you mean for "crazy high number of concurrent network shuffles"?how can I count that?
We're using java 8.

Thanks a lot,
Flavio



On 6 Jun 2017 7:13 pm, "Stephan Ewen" <[hidden email]> wrote:
Hi!

I would actually be surprised if this is an issue in core Flink.

  - The MaxDirectMemory parameter is pretty meaningless, it really is a max and does not have an impact on how much is actually allocated.

  - In most cases we had reported so far, the leak was in a library that was used in the user code

  - If you do not use offheap memory in Flink, then there are few other culprits that can cause high virtual memory consumption:
      - Netty, if you bumped the Netty version in a custom build
      - Flink's Netty, if the job has a crazy high number of concurrent network shuffles (we are talking 1000s here)
      - Some old Java versions have I/O memory leaks (I think some older Java 6 and Java 7 versions were affected)


To diagnose that better:

  - Are these batch or streaming jobs? 
  - If it is streaming, which state backend are you using?

Stephan


On Tue, Jun 6, 2017 at 12:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?

Thanks, Fabian

2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:

1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific  job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memory required by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to be less affected by the problem )
3. Although the memory consumption is much better this way, the Flink TM memory continuously grow job after job (of this problematic type): we set TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is that possible?

My fear is that there's some annoying memory leak / bad memory allocation in the Flink network level, but I can't have any evidence of this (except the fact that the vm which doesn't have a hdfs datanode underneath the Flink TM is the one with the biggest TM virtual memory consumption).

Thanks for the help ,
Flavio

On 29 May 2017 15:37, "Nico Kruber" <[hidden email]> wrote:
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Flink and swapping question

Flavio Pompermaier
I forgot to mention that my jobs are all batch (at the moment).

Do you think that this problem could be related to 
Kurt told me also to add "env.java.opts: -Dio.netty.recycler.maxCapacity.default=1" .

Best,
Flavio

On Tue, Jun 6, 2017 at 7:42 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Stephan,
I also think that the error is more related to netty.
The only suspicious library I use are parquet or thrift.
I'm not using off-heap memory.
What do you mean for "crazy high number of concurrent network shuffles"?how can I count that?
We're using java 8.

Thanks a lot,
Flavio



On 6 Jun 2017 7:13 pm, "Stephan Ewen" <[hidden email]> wrote:
Hi!

I would actually be surprised if this is an issue in core Flink.

  - The MaxDirectMemory parameter is pretty meaningless, it really is a max and does not have an impact on how much is actually allocated.

  - In most cases we had reported so far, the leak was in a library that was used in the user code

  - If you do not use offheap memory in Flink, then there are few other culprits that can cause high virtual memory consumption:
      - Netty, if you bumped the Netty version in a custom build
      - Flink's Netty, if the job has a crazy high number of concurrent network shuffles (we are talking 1000s here)
      - Some old Java versions have I/O memory leaks (I think some older Java 6 and Java 7 versions were affected)


To diagnose that better:

  - Are these batch or streaming jobs? 
  - If it is streaming, which state backend are you using?

Stephan


On Tue, Jun 6, 2017 at 12:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?

Thanks, Fabian

2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:

1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific  job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memory required by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to be less affected by the problem )
3. Although the memory consumption is much better this way, the Flink TM memory continuously grow job after job (of this problematic type): we set TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is that possible?

My fear is that there's some annoying memory leak / bad memory allocation in the Flink network level, but I can't have any evidence of this (except the fact that the vm which doesn't have a hdfs datanode underneath the Flink TM is the one with the biggest TM virtual memory consumption).

Thanks for the help ,
Flavio

On 29 May 2017 15:37, "Nico Kruber" <[hidden email]> wrote:
FYI: taskmanager.sh sets this parameter but also states the following:

  # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
  TM_MAX_OFFHEAP_SIZE="8388607T"


Nico

On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on YARN or bare metal? Did you manage to find out where this
> insanely large parameter is coming from?
>
> Best,
> Aljoscha
>
> > On 25. May 2017, at 19:36, Flavio Pompermaier <[hidden email]>
> > wrote:
> >
> > Hi to all,
> > I think we found the root cause of all the problems. Looking ad dmesg
> > there was a "crazy" total-vm size associated to the OOM error, a LOT much
> > bigger than the TaskManager's available memory. In our case, the TM had a
> > max heap of 14 GB while the dmsg error was reporting a required amount of
> > memory in the order of 60 GB!
> >
> > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
> > sacrifice child [ 5331.992619] Killed process 24221 (java)
> > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB
> >
> > That wasn't definitively possible usin an ordinary JVM (and our TM was
> > running without off-heap settings) so we've looked at the parameters used
> > to run the TM JVM and indeed there was a reall huge amount of memory
> > given to MaxDirectMemorySize. With my big surprise Flink runs a TM with
> > this parameter set to 8.388.607T..does it make any sense?? Is it
> > documented anywhere the importance of this parameter (and why it is used
> > in non off-heap mode as well)? Is it related to network buffers? It
> > should also be documented that this parameter should be added to the TM
> > heap when reserving memory to Flin (IMHO).
> >
> > I hope that this painful sessions of Flink troubleshooting could be an
> > added value sooner or later..
> >
> > Best,
> > Flavio
> >
> > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: I can confirm that after giving
> > less memory to the Flink TM the job was able to run successfully. After
> > almost 2 weeks of pain, we summarize here our experience with Fink in
> > virtualized environments (such as VMWare ESXi): Disable the
> > virtualization "feature" that transfer a VM from a (heavy loaded)
> > physical machine to another one (to balance the resource consumption)
> > Check dmesg when a TM dies without logging anything (usually it goes OOM
> > and the OS kills it but there you can find the log of this thing) CentOS
> > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
> > starting swapping also if there are 12 out of 32 GB of free memory)!
> > We're still investigating how this behavior could be fixed: the problem
> > is that it's better not to disable swapping because otherwise VMWare
> > could start ballooning (that is definitely worse...).
> >
> > I hope this tips could save someone else's day..
> >
> > Best,
> > Flavio
> >
> > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg, you were right! After
> > typing dmsg I found "Out of memory: Kill process 13574 (java)". This is
> > really strange because the JVM of the TM is very calm.
> > Moreover, there are 7 GB of memory available (out of 32) but somehow the
> > OS decides to start swapping and, when it runs out of available swap
> > memory, the OS decides to kill the Flink TM :(
> >
> > Any idea of what's going on here?
> >
> > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Greg,
> > I carefully monitored all TM memory with jstat -gcutil and there'no full
> > gc, only .>
> > The initial situation on the dying TM is:
> >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
> >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
> >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197    2.617
> >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17    234
> >    2.730     1    0.255    2.986>
> > After about 10 hours of processing is:
> >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
> >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
> >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011
> >      1    0.255   33.267>
> > So I don't think thta OOM could be an option.
> >
> > However, the cluster is running on ESXi vSphere VMs and we already
> > experienced unexpected crash of jobs because of ESXi moving a
> > heavy-loaded VM to another (less loaded) physical machine..I would't be
> > surprised if swapping is also handled somehow differently.. Looking at
> > Cloudera widgets I see that the crash is usually preceded by an intense
> > cpu_iowait period. I fear that Flink unsafe access to memory could be a
> > problem in those scenarios. Am I wrong?
> >
> > Any insight or debugging technique is  greatly appreciated.
> > Best,
> > Flavio
> >
> >
> > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi Flavio,
> >
> > Flink handles interrupts so the only silent killer I am aware of is
> > Linux's OOM killer. Are you seeing such a message in dmesg?
> >
> > Greg
> >
> > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <[hidden email]
> > <mailto:[hidden email]>> wrote: Hi to all,
> > I'd like to know whether memory swapping could cause a taskmanager crash.
> > In my cluster of virtual machines 'm seeing this strange behavior in my
> > Flink cluster: sometimes, if memory get swapped the taskmanager (on that
> > machine) dies unexpectedly without any log about the error.
> >
> > Is that possible or not?
> >
> > Best,
> > Flavio