Hi to all, I have a Flink 1.3.1 job that runs multiple times. Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die. In the .out file I find something like this:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320 # # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libc.so.6+0x7f12f] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/user/hs_err_pid18794. # # If you would like to submit a bug report, please visit: # Attached the produced error report. Do you find anything useful? I can even send you the job's jar with the data but it requires about 200 MB.. Best, Flavio hs_err_pid18794.log (104K) Download Attachment |
Hi,
that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches. Best, Stefan
|
My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned? On Mon, 14 May 2018, 14:23 Stefan Richter, <[hidden email]> wrote:
|
No, that problem I mentioned does not affect batch jobs. Must be something different then, but unfortunately the dump looks not very helpful to me because of the „error occurred during error reporting (printing native stack)“.
|
Free forum by Nabble | Edit this page |