Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters. The job fails in about half an hour after it starts. Related logs is attached as an imange. This piece of log comes from one of the taskmanagers. There are not any other related log lines. No ERROR-level logs. The job just runs for tens of minutes without printing any logs and suddenly throws this exception. It is reproducable in my production environment, but not in my test environment. The 'Buffer pool is destroed' exception is always thrown while emitting latency marker. image.png (92K) Download Attachment |
Hi, I think the problem in the attched image is not the root cause of your job failure. It must exist other task or TaskManager failures, then all the related tasks will be cancelled by job manager, and the problem in attched image is just caused by task cancelled. You can review the log of job manager to check whether there are any failures to cause failing the whole job. FYI, the task manager may be killed by yarn because of memory exceed. You mentioned the job fails in half an hour after starts, so I guess it exits the possibility that the task manager is killed by yarn. Best, Zhijiang
|
In reply to this post by 杨力
Hi Bill, Can you provide more information, such as whether Checkpoint is enabled and whether exact-once is specified, and whether there is back pressure generated in the Flink web UI. Here is a ticket that also gives feedback to this question. [1] Stackoverflow has also been asked the same question, but I don't know if the answer is valid.[2] Thanks, vino. 杨力 <[hidden email]> 于2018年9月7日周五 下午1:09写道: Hi all, |
Thank you for you advice. I had not noticed that the log level was set to WARN. INFO logs suggest that the job fails because of akka timeout and the root cause is long gc pause. On Fri, Sep 7, 2018 at 5:43 PM Zhijiang(wangzhijiang999) <[hidden email]> wrote:
cy marker. |
Free forum by Nabble | Edit this page |