Hi Flink Community,
In our flink deployments, we see some flink threads are cpu busy/stuck after few hours with the below stack
"Sink:AggregationSink (2/4)" #567 daemon prio=5 os_prio=0 tid=0x00007f901dc97000 nid=0x254 runnable [0x00007f8fe017f000]
java.lang.Thread.State: RUNNABLE
at org.apache.flink.api.java.typeutils.runtime.DataInputViewStream.read(DataInputViewStream.java:70)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:146)
at org.apache.flink.api.java.typeutils.runtime.NoFetchingInput.require(NoFetchingInput.java:76)
at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:355)
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:236)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:187)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:40)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:109)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:147)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:261)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:656)
at java.lang.Thread.run(Thread.java:745)
Few observations from the stack, while it is stuck.
SpillingAdaptiveSpanningRecordDeserializer.getNextRecord
nonSpanningRemaining = 13
All these 13 bytes were read even before reaching
com.esotericsoftware.kryo.io.Input.readVarInt
We are wondering is this a serialization bug or memory segment corruption?
Any pointers on how to debug further will be much appreciated.
Regards,
Mehar