Re: Stream Task seems to be blocked after checkpoint timeout

Posted by Tony Wei on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Stream-Task-seems-to-be-blocked-after-checkpoint-timeout-tp15861p15869.html

Hi,

These are some metrics before I stopped TM. The black parts are sensitive to our company, so I need to hide them. Sorry about that.
Hope this will help to understand what happened to my streaming job. Thank you.

This job only has two tasks. The tasks and operators for records out per second(records in per second) are in front of(behind of) the buffer.
內置圖片 1

Best Regards,
Tony Wei

2017-09-26 23:08 GMT+08:00 Tony Wei <[hidden email]>:
Hi,

Something weird happened on my streaming job.

I found my streaming job seems to be blocked for a long time and I saw the situation like the picture below. (chk #1245 and #1246 were all finishing 7/8 tasks then marked timeout by JM. Other checkpoints failed with the same state like #1247 util I restarted TM.)

內置圖片 1

I'm not sure what happened, but the consumer stopped fetching records, buffer usage is 100% and the following task did not seem to fetch data anymore. Just like the whole TM was stopped.

However, after I restarted TM and force the job restarting from the latest completed checkpoint, everything worked again. And I don't know how to reproduce it.

The attachment is my TM log. Because there are many user logs and sensitive information, I only remain the log from `org.apache.flink...`.

My cluster setting is one JM and one TM with 4 available slots.

Streaming job uses all slots, checkpoint interval is 5 mins and max concurrent number is 3.

Please let me know if it needs more information to find out what happened on my streaming job. Thanks for your help.

Best Regards,
Tony Wei