flink 1.7.2 freezes, waiting indefinitely for the buffer availability

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

flink 1.7.2 freezes, waiting indefinitely for the buffer availability

Indraneel R
Hi,

We are trying to run a very simple flink pipeline, which is used to sessionize events from a kinesis stream. Its an 
 - event time window with a 30 min gap and 
 - trigger interval of 15 mins and 
 - late arrival time duration of 10 hrs
This is how the graph looks.

Screenshot 2019-04-10 at 12.08.25 AM.png
But what we are observing is that after 2-3 days of continuous run the job becomes progressively unstable and completely freezes.

And the thread dump analysis revealed that it is actually indefinitely waiting at
    `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)`
for a memory segment to be available. 
And while it is waiting it holds and checkpoint lock, and therefore blocks all other threads as well, since they are all requesting for a lock on `checkpointLock` object.

But we are not able to figure out why its not able to get any segment. Because there is no indication of backpressure, at least on the flink UI. 
And here are our job configurations:

number of Taskmanagers : 4
jobmanager.heap.size: 8000m
taskmanager.heap.size: 11000m
taskmanager.numberOfTaskSlots: 4
parallelism.default: 16
taskmanager.network.memory.max: 5gb
taskmanager.network.memory.min: 3gb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.memory.size: 13gb  

data rate : 250 messages/sec
or 1mb/sec

Any ideas on what could be the issue? 

regards
-Indraneel



thread-dump-analysis.txt (8K) Download Attachment
source-blocking-other-threads.png (101K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.7.2 freezes, waiting indefinitely for the buffer availability

Rahul Jain
We are also seeing something very similar. Looks like a bug. 

It seems to get stuck in LocalBufferPool forever and the job has to be restarted.

Is anyone else facing this too?

On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <[hidden email]> wrote:
Hi,

We are trying to run a very simple flink pipeline, which is used to sessionize events from a kinesis stream. Its an 
 - event time window with a 30 min gap and 
 - trigger interval of 15 mins and 
 - late arrival time duration of 10 hrs
This is how the graph looks.

Screenshot 2019-04-10 at 12.08.25 AM.png
But what we are observing is that after 2-3 days of continuous run the job becomes progressively unstable and completely freezes.

And the thread dump analysis revealed that it is actually indefinitely waiting at
    `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)`
for a memory segment to be available. 
And while it is waiting it holds and checkpoint lock, and therefore blocks all other threads as well, since they are all requesting for a lock on `checkpointLock` object.

But we are not able to figure out why its not able to get any segment. Because there is no indication of backpressure, at least on the flink UI. 
And here are our job configurations:

number of Taskmanagers : 4
jobmanager.heap.size: 8000m
taskmanager.heap.size: 11000m
taskmanager.numberOfTaskSlots: 4
parallelism.default: 16
taskmanager.network.memory.max: 5gb
taskmanager.network.memory.min: 3gb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.memory.size: 13gb  

data rate : 250 messages/sec
or 1mb/sec

Any ideas on what could be the issue? 

regards
-Indraneel


Reply | Threaded
Open this post in threaded view
|

Re: flink 1.7.2 freezes, waiting indefinitely for the buffer availability

Guowei Ma
Hi ,
Could you jstak the downstream Task (the Window) and have a look at what the window operator is doing?
Best,
Guowei


Rahul Jain <[hidden email]> 于2019年4月10日周三 下午1:04写道:
We are also seeing something very similar. Looks like a bug. 

It seems to get stuck in LocalBufferPool forever and the job has to be restarted.

Is anyone else facing this too?

On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <[hidden email]> wrote:
Hi,

We are trying to run a very simple flink pipeline, which is used to sessionize events from a kinesis stream. Its an 
 - event time window with a 30 min gap and 
 - trigger interval of 15 mins and 
 - late arrival time duration of 10 hrs
This is how the graph looks.

Screenshot 2019-04-10 at 12.08.25 AM.png
But what we are observing is that after 2-3 days of continuous run the job becomes progressively unstable and completely freezes.

And the thread dump analysis revealed that it is actually indefinitely waiting at
    `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)`
for a memory segment to be available. 
And while it is waiting it holds and checkpoint lock, and therefore blocks all other threads as well, since they are all requesting for a lock on `checkpointLock` object.

But we are not able to figure out why its not able to get any segment. Because there is no indication of backpressure, at least on the flink UI. 
And here are our job configurations:

number of Taskmanagers : 4
jobmanager.heap.size: 8000m
taskmanager.heap.size: 11000m
taskmanager.numberOfTaskSlots: 4
parallelism.default: 16
taskmanager.network.memory.max: 5gb
taskmanager.network.memory.min: 3gb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.memory.size: 13gb  

data rate : 250 messages/sec
or 1mb/sec

Any ideas on what could be the issue? 

regards
-Indraneel


Reply | Threaded
Open this post in threaded view
|

Re: flink 1.7.2 freezes, waiting indefinitely for the buffer availability

Indraneel R
Hi,

We analysed that, and even some of the sink threads seem to be waiting on some lock. But it's not very clear what object it is waiting for.

Async calls on Window(EventTimeSessionWindows(1800000), ContinuousEventTimeTrigger, ScalaProcessWindowFunctionWrapper) -> Map -> Sink: Unnamed (14/16) - priority:5 - threadId:0x00007fa5ac008800 - nativeId:0xf5e - nativeId (decimal):3934 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000060199b4d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None

Attached is a complete thread dump.

regards
-Indraneel



On Wed, Apr 10, 2019 at 10:43 AM Guowei Ma <[hidden email]> wrote:
Hi ,
Could you jstak the downstream Task (the Window) and have a look at what the window operator is doing?
Best,
Guowei


Rahul Jain <[hidden email]> 于2019年4月10日周三 下午1:04写道:
We are also seeing something very similar. Looks like a bug. 

It seems to get stuck in LocalBufferPool forever and the job has to be restarted.

Is anyone else facing this too?

On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <[hidden email]> wrote:
Hi,

We are trying to run a very simple flink pipeline, which is used to sessionize events from a kinesis stream. Its an 
 - event time window with a 30 min gap and 
 - trigger interval of 15 mins and 
 - late arrival time duration of 10 hrs
This is how the graph looks.

Screenshot 2019-04-10 at 12.08.25 AM.png
But what we are observing is that after 2-3 days of continuous run the job becomes progressively unstable and completely freezes.

And the thread dump analysis revealed that it is actually indefinitely waiting at
    `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)`
for a memory segment to be available. 
And while it is waiting it holds and checkpoint lock, and therefore blocks all other threads as well, since they are all requesting for a lock on `checkpointLock` object.

But we are not able to figure out why its not able to get any segment. Because there is no indication of backpressure, at least on the flink UI. 
And here are our job configurations:

number of Taskmanagers : 4
jobmanager.heap.size: 8000m
taskmanager.heap.size: 11000m
taskmanager.numberOfTaskSlots: 4
parallelism.default: 16
taskmanager.network.memory.max: 5gb
taskmanager.network.memory.min: 3gb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.memory.size: 13gb  

data rate : 250 messages/sec
or 1mb/sec

Any ideas on what could be the issue? 

regards
-Indraneel



jstack-2v9lj.txt (174K) Download Attachment