|
Hello,
We run into the same problem. We've done most of the same steps/observations: - increase memory
- increase cpu
- No noticable increase in GC activity
- Little network io
Our current setup has the liveliness probe disabled and we've increased (akka)timeouts, this seems to help a bit but it still feels very unresponsive. Hi Prakhar,
Sorry I don't have much experience on k8s. Maybe some other guys could help.
On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur < [hidden email]> wrote: Hi,
So we were deploying our flink clusters on YARN earlier but then we moved to kubernetes, but then our clusters were not this big. Have you guys seen issues with job manager rest server becoming irresponsive on kubernetes before?
Hi Prakhar,
Sorry I could not find any abnormal message from your GC log and stack trace. Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.
On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur < [hidden email]> wrote:
Hi,
We enabled GC logging, here are the logs
[GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)] 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)] 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00, real=0.09 secs] [GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)] 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)] 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)] 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)] 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)] 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00, real=0.10 secs] [GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)] 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)] 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)] 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)] 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)] 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)] 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71, real=1.17 secs] [GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)] 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00, real=0.09 secs] [GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)] 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)] 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] [GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)] 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00, real=0.09 secs] [GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)] 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00, real=0.09 secs] [GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)] 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00, real=0.08 secs] [GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)] 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)] 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)] 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)] 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)] 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22, real=1.18 secs] [GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)] 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01, real=0.08 secs] [GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)] 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)] 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] [GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)] 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01, real=0.08 secs] [GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)] 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)] 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)] 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00, real=0.09 secs] [GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)] 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00, real=0.06 secs] [GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)] 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00, real=0.10 secs] [GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)] 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00, real=0.07 secs] [GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)] 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00, real=0.08 secs]
We were going through the thread dump, but couldn't find a reason for such behaviour
"flink-rest-server-netty-worker-thread-8" - Thread t@126233 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <2654bbbf> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <45bd4356> (a java.util.Collections$UnmodifiableSet) - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"flink-rest-server-netty-worker-thread-7" - Thread t@123834 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <2a9b97d7> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <29115878> (a java.util.Collections$UnmodifiableSet) - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"flink-rest-server-netty-worker-thread-6" - Thread t@121404 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <6ee94bd6> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet) - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"flink-rest-server-netty-worker-thread-5" - Thread t@118945 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <7553b573> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet) - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"flink-rest-server-netty-worker-thread-4" - Thread t@117073 java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <78572130> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <1047fb64> (a java.util.Collections$UnmodifiableSet) - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"gcs-async-channel-pool-19" - Thread t@116997 java.lang.Thread.State: TIMED_WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <1be52a4a> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers: - None
"RMI TCP Connection(36)-127.0.0.1" - Thread t@114070 java.lang.Thread.State: TIMED_WAITING at java.lang.Object.wait(Native Method) - waiting on <22b14b0b> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer) at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449) at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227) at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274) at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270) at javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268) at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357) at sun.rmi.transport.Transport$1.run(Transport.java:200) at sun.rmi.transport.Transport$1.run(Transport.java:197) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:196) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Hi,
It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct. I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.
Have you ever checked the log file of JobManager? Is there any suspicious warning or error log? Have you ever tried some analytic tools to check the internal state of JobManager, like jstack.
It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details.
Hi,
We are using v1.6.2, currently, the number of TaskManagers are 70. We have the GC metrics on a dashboard also. Sum of Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is somewhere between 75 to 125 and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.
Hi Prakhar,
Have you ever checked the garbage collection of master? Which version of Flink are you using? How many TaskManagers in your cluster?
Hello,
We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.
Regards Prakhar Mathur Product Engineer GO-JEK
|