(DEPRECATED) Apache Flink User Mailing List archive.

Taskmanager JVM crash

Classic

List

Threaded

4 messages Options

Flavio Pompermaier

Taskmanager JVM crash

Hi to all,

I have a Flink 1.3.1 job that runs multiple times.

Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

# A fatal error has been detected by the Java Runtime Environment:

# SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320

# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)

# Problematic frame:

# C [libc.so.6+0x7f12f]

# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

# An error report file with more information is saved as:

# /home/user/hs_err_pid18794.log

# If you would like to submit a bug report, please visit:

# http://bugreport.java.com/bugreport/crash.jsp

Attached the produced error report. Do you find anything useful?

I can even send you the job's jar with the data but it requires about 200 MB..

Best,

Flavio

hs_err_pid18794.log (104K) Download Attachment

Stefan Richter

Re: Taskmanager JVM crash

Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,

Stefan

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#

Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio

<hs_err_pid18794.log>

Flavio Pompermaier

Re: Taskmanager JVM crash

My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned?

On Mon, 14 May 2018, 14:23 Stefan Richter, <[hidden email]> wrote:

Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#

Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio

<hs_err_pid18794.log>

Stefan Richter

Re: Taskmanager JVM crash

No, that problem I mentioned does not affect batch jobs. Must be something different then, but unfortunately the dump looks not very helpful to me because of the „error occurred during error reporting (printing native stack)“.

Am 14.05.2018 um 14:26 schrieb Flavio Pompermaier <[hidden email]>:

My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned?

On Mon, 14 May 2018, 14:23 Stefan Richter, <[hidden email]> wrote:
Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#

Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio

<hs_err_pid18794.log>