Taskmanager JVM crash

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Taskmanager JVM crash

Flavio Pompermaier
Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
#


Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio

hs_err_pid18794.log (104K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager JVM crash

Stefan Richter
Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan 

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
#


Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio
<hs_err_pid18794.log>

Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager JVM crash

Flavio Pompermaier
My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned?

On Mon, 14 May 2018, 14:23 Stefan Richter, <[hidden email]> wrote:
Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan 

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
#


Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio
<hs_err_pid18794.log>

Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager JVM crash

Stefan Richter
No, that problem I mentioned does not affect batch jobs. Must be something different then, but unfortunately the dump looks not very helpful to me because of the „error occurred during error reporting (printing native stack)“.

Am 14.05.2018 um 14:26 schrieb Flavio Pompermaier <[hidden email]>:

My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned?

On Mon, 14 May 2018, 14:23 Stefan Richter, <[hidden email]> wrote:
Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan 

Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <[hidden email]>:

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
#


Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200 MB..

Best,
Flavio
<hs_err_pid18794.log>