Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

杨力
Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters.
The job fails in about half an hour after it starts. Related logs is attached as an imange.

This piece of log comes from one of the taskmanagers. There are not any other related log lines.
No ERROR-level logs. The job just runs for tens of minutes without printing any logs
and suddenly throws this exception.

It is reproducable in my production environment, but not in my test environment.
The 'Buffer pool is destroed' exception is always thrown while emitting latency marker.

image.png (92K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

回复:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

Zhijiang(wangzhijiang999)
Hi,

I think the problem in the attched image is not the root cause of your job failure. It must exist other task or TaskManager failures, then all the related tasks will be cancelled by job manager, and the problem in attched image is just caused by task cancelled.

You can review the log of job manager to check whether there are any failures to cause failing the whole job.
FYI, the task manager may be killed by yarn because of memory exceed. You mentioned the job fails in half an hour after starts, so I guess it exits the possibility that the task manager is killed by yarn.

Best,
Zhijiang
------------------------------------------------------------------
发件人:杨力 <[hidden email]>
发送时间:2018年9月7日(星期五) 13:09
收件人:user <[hidden email]>
主 题:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters.
The job fails in about half an hour after it starts. Related logs is attached as an imange.

This piece of log comes from one of the taskmanagers. There are not any other related log lines.
No ERROR-level logs. The job just runs for tens of minutes without printing any logs
and suddenly throws this exception.

It is reproducable in my production environment, but not in my test environment.
The 'Buffer pool is destroed' exception is always thrown while emitting latency marker.

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

vino yang
In reply to this post by 杨力
Hi Bill,

Can you provide more information, such as whether Checkpoint is enabled and whether exact-once is specified, and whether there is back pressure generated in the Flink web UI. 
Here is a ticket that also gives feedback to this question. [1]
Stackoverflow has also been asked the same question, but I don't know if the answer is valid.[2]


Thanks, vino.

杨力 <[hidden email]> 于2018年9月7日周五 下午1:09写道:
Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters.
The job fails in about half an hour after it starts. Related logs is attached as an imange.

This piece of log comes from one of the taskmanagers. There are not any other related log lines.
No ERROR-level logs. The job just runs for tens of minutes without printing any logs
and suddenly throws this exception.

It is reproducable in my production environment, but not in my test environment.
The 'Buffer pool is destroed' exception is always thrown while emitting latency marker.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

杨力
Thank you for you advice. I had not noticed that the log level was set to WARN.
INFO logs suggest that the job fails because of akka timeout and the root cause is long gc pause.

On Fri, Sep 7, 2018 at 5:43 PM Zhijiang(wangzhijiang999) <[hidden email]> wrote:
You may need to config at least INFO level for logger in flink, and currently the messages are so limited for debugging the problem.

Best,
Zhijiang
------------------------------------------------------------------
发件人:杨力 <[hidden email]>
发送时间:2018年9月7日(星期五) 17:21
收件人:Zhijiang(wangzhijiang999) <[hidden email]>
主 题:Re: Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

I have checked logs from yarn nodemanagers, and there are no killing action record. There are no job canceling record in jobmanager's log either.

Here are job logs retrieved from yarn.

https://pastebin.com/raw/1yHLYR65

Zhijiang(wangzhijiang999) <[hidden email]> 于 2018年9月7日周五 下午3:22写道:
Hi,

I think the problem in the attched image is not the root cause of your job failure. It must exist other task or TaskManager failures, then all the related tasks will be cancelled by job manager, and the problem in attched image is just caused by task cancelled.

You can review the log of job manager to check whether there are any failures to cause failing the whole job.
FYI, the task manager may be killed by yarn because of memory exceed. You mentioned the job fails in half an hour after starts, so I guess it exits the possibility that the task manager is killed by yarn.

Best,
Zhijiang
------------------------------------------------------------------
发件人:杨力 <[hidden email]>
发送时间:2018年9月7日(星期五) 13:09
收件人:user <[hidden email]>
主 题:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters.
The job fails in about half an hour after it starts. Related logs is attached as an imange.

This piece of log comes from one of the taskmanagers. There are not any other related log lines.
No ERROR-level logs. The job just runs for tens of minutes without printing any logs
and suddenly throws this exception.

It is reproducable in my production environment, but not in my test environment.
The 'Buffer pool is destroed' exception is always thrown while emitting latency marker.

cy marker.