Job Manager becomes irresponsive if the size of the session cluster grows

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Job Manager becomes irresponsive if the size of the session cluster grows

prakhar_mathur
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Biao Liu
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Biao Liu
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

prakhar_mathur

On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <[hidden email]> wrote:
Hi,

We enabled GC logging, here are the logs

[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs]
[GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs]
[GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]


We were going through the thread dump, but couldn't find a reason for such behaviour

"flink-rest-server-netty-worker-thread-8" - Thread t@126233
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
        - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-7" - Thread t@123834
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <29115878> (a java.util.Collections$UnmodifiableSet)
        - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-6" - Thread t@121404
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
        - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-5" - Thread t@118945
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
        - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-4" - Thread t@117073
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
        - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"gcs-async-channel-pool-19" - Thread t@116997
   java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
   java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
        at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
        at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <[hidden email]> wrote:
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Biao Liu
Hi Prakhar,

Sorry I could not find any abnormal message from your GC log and stack trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.


On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <[hidden email]> wrote:

On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <[hidden email]> wrote:
Hi,

We enabled GC logging, here are the logs

[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs]
[GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs]
[GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]


We were going through the thread dump, but couldn't find a reason for such behaviour

"flink-rest-server-netty-worker-thread-8" - Thread t@126233
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
        - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-7" - Thread t@123834
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <29115878> (a java.util.Collections$UnmodifiableSet)
        - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-6" - Thread t@121404
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
        - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-5" - Thread t@118945
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
        - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-4" - Thread t@117073
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
        - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"gcs-async-channel-pool-19" - Thread t@116997
   java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
   java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
        at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
        at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <[hidden email]> wrote:
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

prakhar_mathur
Hi,

So we were deploying our flink clusters on YARN earlier but then we moved to kubernetes, but then our clusters were not this big. Have you guys seen issues with job manager rest server becoming irresponsive on kubernetes before?

On Fri, Jul 26, 2019, 14:28 Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Sorry I could not find any abnormal message from your GC log and stack trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.


On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <[hidden email]> wrote:

On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <[hidden email]> wrote:
Hi,

We enabled GC logging, here are the logs

[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs]
[GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs]
[GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]


We were going through the thread dump, but couldn't find a reason for such behaviour

"flink-rest-server-netty-worker-thread-8" - Thread t@126233
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
        - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-7" - Thread t@123834
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <29115878> (a java.util.Collections$UnmodifiableSet)
        - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-6" - Thread t@121404
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
        - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-5" - Thread t@118945
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
        - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-4" - Thread t@117073
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
        - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"gcs-async-channel-pool-19" - Thread t@116997
   java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
   java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
        at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
        at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <[hidden email]> wrote:
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Biao Liu
Hi Prakhar,

Sorry I don't have much experience on k8s. Maybe some other guys could help.

On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur <[hidden email]> wrote:
Hi,

So we were deploying our flink clusters on YARN earlier but then we moved to kubernetes, but then our clusters were not this big. Have you guys seen issues with job manager rest server becoming irresponsive on kubernetes before?

On Fri, Jul 26, 2019, 14:28 Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Sorry I could not find any abnormal message from your GC log and stack trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.


On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <[hidden email]> wrote:

On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <[hidden email]> wrote:
Hi,

We enabled GC logging, here are the logs

[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs]
[GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs]
[GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]


We were going through the thread dump, but couldn't find a reason for such behaviour

"flink-rest-server-netty-worker-thread-8" - Thread t@126233
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
        - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-7" - Thread t@123834
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <29115878> (a java.util.Collections$UnmodifiableSet)
        - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-6" - Thread t@121404
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
        - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-5" - Thread t@118945
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
        - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-4" - Thread t@117073
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
        - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"gcs-async-channel-pool-19" - Thread t@116997
   java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
   java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
        at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
        at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <[hidden email]> wrote:
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK
Reply | Threaded
Open this post in threaded view
|

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Richard Deurwaarder
Hello,

We run into the same problem. We've done most of the same steps/observations: 
  • increase memory
  • increase cpu
  • No noticable increase in GC activity
  • Little network io
Our current setup has the liveliness probe disabled and we've increased (akka)timeouts, this seems to help a bit but it still feels very unresponsive.
 

On Fri, Jul 26, 2019 at 12:41 PM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Sorry I don't have much experience on k8s. Maybe some other guys could help.

On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur <[hidden email]> wrote:
Hi,

So we were deploying our flink clusters on YARN earlier but then we moved to kubernetes, but then our clusters were not this big. Have you guys seen issues with job manager rest server becoming irresponsive on kubernetes before?

On Fri, Jul 26, 2019, 14:28 Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Sorry I could not find any abnormal message from your GC log and stack trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.


On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <[hidden email]> wrote:

On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <[hidden email]> wrote:
Hi,

We enabled GC logging, here are the logs

[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs]
[GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs]
[GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs]
[GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]
[GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs]
[GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs]
[GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs]
[GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]


We were going through the thread dump, but couldn't find a reason for such behaviour

"flink-rest-server-netty-worker-thread-8" - Thread t@126233
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
        - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-7" - Thread t@123834
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <29115878> (a java.util.Collections$UnmodifiableSet)
        - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-6" - Thread t@121404
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
        - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-5" - Thread t@118945
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
        - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"flink-rest-server-netty-worker-thread-4" - Thread t@117073
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
        - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
        - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"gcs-async-channel-pool-19" - Thread t@116997
   java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
   java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
        at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
        at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
        at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
        at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <[hidden email]> wrote:
Hi,

It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.

Have you ever checked the log file of JobManager? Is there any suspicious warning or error log?
Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午2:12写道:
Hi,

We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.

On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <[hidden email]> wrote:
Hi Prakhar,

Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your cluster? 


Prakhar Mathur <[hidden email]> 于2019年7月18日周四 下午1:54写道:
Hello,

We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.

Regards
Prakhar Mathur
Product Engineer
GO-JEK