SIGSEGV error

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

SIGSEGV error

Marek Maj
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Till Rohrmann
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Stephan Ewen
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Joshua Fan
Hi Stephan, Till

Recently, I tried to upgrade a flink job from 1.7 to 1.11, unfortunately, the weird problem appeared, " SIGSEGV (0xb) at pc=0x0000000000000025, pid=135306, tid=140439001388800".  The pid log is attached.
Actually, it is a simple job that consumes messages from kafka and writes into hdfs with a gzip format. It can run in 1.11 for about 2 minutes, then the JVM will crash, then job restart and jvm crash again until the application fails.
I also tried to set -Dsun.zip.disableMemoryMapping=true,but it turns out helpless, the same crash keeps happening. Google suggests to upgrade jdk to jdk1.9, but it is not feasible.
Any suggestions? Thanks a lot.

Yours sincerely
Josh

Stephan Ewen <[hidden email]> 于2019年9月13日周五 下午11:11写道:
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub

hs_err_pid135306.log (173K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Joshua Fan
Hi all,

Most of the posts says that "Most of the times, the crashes in ZIP_GetEntry occur when the jar file being accessed has been modified/overwritten while the JVM instance was running. ", but do not know when and which jar file was modified according to the job running in flink.

for your information.

Yours sincerely
Josh

Joshua Fan <[hidden email]> 于2021年5月18日周二 上午10:15写道:
Hi Stephan, Till

Recently, I tried to upgrade a flink job from 1.7 to 1.11, unfortunately, the weird problem appeared, " SIGSEGV (0xb) at pc=0x0000000000000025, pid=135306, tid=140439001388800".  The pid log is attached.
Actually, it is a simple job that consumes messages from kafka and writes into hdfs with a gzip format. It can run in 1.11 for about 2 minutes, then the JVM will crash, then job restart and jvm crash again until the application fails.
I also tried to set -Dsun.zip.disableMemoryMapping=true,but it turns out helpless, the same crash keeps happening. Google suggests to upgrade jdk to jdk1.9, but it is not feasible.
Any suggestions? Thanks a lot.

Yours sincerely
Josh

Stephan Ewen <[hidden email]> 于2019年9月13日周五 下午11:11写道:
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Till Rohrmann
Hi Joshua,

could you try whether the job also fails when not using the gzip format? This could help us narrow down the culprit. Moreover, you could try to run your job and Flink with Java 11 now.

Cheers,
Till

On Tue, May 18, 2021 at 5:10 AM Joshua Fan <[hidden email]> wrote:
Hi all,

Most of the posts says that "Most of the times, the crashes in ZIP_GetEntry occur when the jar file being accessed has been modified/overwritten while the JVM instance was running. ", but do not know when and which jar file was modified according to the job running in flink.

for your information.

Yours sincerely
Josh

Joshua Fan <[hidden email]> 于2021年5月18日周二 上午10:15写道:
Hi Stephan, Till

Recently, I tried to upgrade a flink job from 1.7 to 1.11, unfortunately, the weird problem appeared, " SIGSEGV (0xb) at pc=0x0000000000000025, pid=135306, tid=140439001388800".  The pid log is attached.
Actually, it is a simple job that consumes messages from kafka and writes into hdfs with a gzip format. It can run in 1.11 for about 2 minutes, then the JVM will crash, then job restart and jvm crash again until the application fails.
I also tried to set -Dsun.zip.disableMemoryMapping=true,but it turns out helpless, the same crash keeps happening. Google suggests to upgrade jdk to jdk1.9, but it is not feasible.
Any suggestions? Thanks a lot.

Yours sincerely
Josh

Stephan Ewen <[hidden email]> 于2019年9月13日周五 下午11:11写道:
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Joshua Fan
Hi Till,
I also tried the job without gzip, it came into the same error.
But the problem is solved now. I was about to give up to solve it, I found the mail at http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-crash-SIGSEGV-in-ZIP-GetEntry-td17326.html. So I think maybe it was something about the serialize staff.
What I have done is :
before:
OperatorStateStore stateStore = context.getOperatorStateStore();
ListStateDescriptor lsd = new ListStateDescriptor("bucket-states",State.class);
after:
OperatorStateStore stateStore = context.getOperatorStateStore();
ListStateDescriptor lsd = new ListStateDescriptor("bucket-states",new JavaSerializer());
Hope this is helpful.

Yours sincerely
Josh



Till Rohrmann <[hidden email]> 于2021年5月18日周二 下午2:54写道:
Hi Joshua,

could you try whether the job also fails when not using the gzip format? This could help us narrow down the culprit. Moreover, you could try to run your job and Flink with Java 11 now.

Cheers,
Till

On Tue, May 18, 2021 at 5:10 AM Joshua Fan <[hidden email]> wrote:
Hi all,

Most of the posts says that "Most of the times, the crashes in ZIP_GetEntry occur when the jar file being accessed has been modified/overwritten while the JVM instance was running. ", but do not know when and which jar file was modified according to the job running in flink.

for your information.

Yours sincerely
Josh

Joshua Fan <[hidden email]> 于2021年5月18日周二 上午10:15写道:
Hi Stephan, Till

Recently, I tried to upgrade a flink job from 1.7 to 1.11, unfortunately, the weird problem appeared, " SIGSEGV (0xb) at pc=0x0000000000000025, pid=135306, tid=140439001388800".  The pid log is attached.
Actually, it is a simple job that consumes messages from kafka and writes into hdfs with a gzip format. It can run in 1.11 for about 2 minutes, then the JVM will crash, then job restart and jvm crash again until the application fails.
I also tried to set -Dsun.zip.disableMemoryMapping=true,but it turns out helpless, the same crash keeps happening. Google suggests to upgrade jdk to jdk1.9, but it is not feasible.
Any suggestions? Thanks a lot.

Yours sincerely
Josh

Stephan Ewen <[hidden email]> 于2019年9月13日周五 下午11:11写道:
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV error

Till Rohrmann
Great to hear that you fixed the problem by specifying an explicit serializer for the state.

Cheers,
Till

On Tue, May 18, 2021 at 9:43 AM Joshua Fan <[hidden email]> wrote:
Hi Till,
I also tried the job without gzip, it came into the same error.
But the problem is solved now. I was about to give up to solve it, I found the mail at http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-crash-SIGSEGV-in-ZIP-GetEntry-td17326.html. So I think maybe it was something about the serialize staff.
What I have done is :
before:
OperatorStateStore stateStore = context.getOperatorStateStore();
ListStateDescriptor lsd = new ListStateDescriptor("bucket-states",State.class);
after:
OperatorStateStore stateStore = context.getOperatorStateStore();
ListStateDescriptor lsd = new ListStateDescriptor("bucket-states",new JavaSerializer());
Hope this is helpful.

Yours sincerely
Josh



Till Rohrmann <[hidden email]> 于2021年5月18日周二 下午2:54写道:
Hi Joshua,

could you try whether the job also fails when not using the gzip format? This could help us narrow down the culprit. Moreover, you could try to run your job and Flink with Java 11 now.

Cheers,
Till

On Tue, May 18, 2021 at 5:10 AM Joshua Fan <[hidden email]> wrote:
Hi all,

Most of the posts says that "Most of the times, the crashes in ZIP_GetEntry occur when the jar file being accessed has been modified/overwritten while the JVM instance was running. ", but do not know when and which jar file was modified according to the job running in flink.

for your information.

Yours sincerely
Josh

Joshua Fan <[hidden email]> 于2021年5月18日周二 上午10:15写道:
Hi Stephan, Till

Recently, I tried to upgrade a flink job from 1.7 to 1.11, unfortunately, the weird problem appeared, " SIGSEGV (0xb) at pc=0x0000000000000025, pid=135306, tid=140439001388800".  The pid log is attached.
Actually, it is a simple job that consumes messages from kafka and writes into hdfs with a gzip format. It can run in 1.11 for about 2 minutes, then the JVM will crash, then job restart and jvm crash again until the application fails.
I also tried to set -Dsun.zip.disableMemoryMapping=true,but it turns out helpless, the same crash keeps happening. Google suggests to upgrade jdk to jdk1.9, but it is not feasible.
Any suggestions? Thanks a lot.

Yours sincerely
Josh

Stephan Ewen <[hidden email]> 于2019年9月13日周五 下午11:11写道:
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink.

I found this on StackOverflow, which looks like it could be related: https://stackoverflow.com/questions/38326183/jvm-crashed-in-java-util-zip-zipfile-getentry
Can you try the suggested option "-Dsun.zip.disableMemoryMapping=true"?


On Fri, Sep 13, 2019 at 11:36 AM Till Rohrmann <[hidden email]> wrote:
Hi Marek,

could you share the logs statements which happened before the SIGSEGV with us? They might be helpful to understand what happened before. Moreover, it would be helpful to get access to your custom serializer implementations. I'm also pulling in Gordon who worked on the TypeSerializerSnapshot improvements.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:28 AM Marek Maj <[hidden email]> wrote:
Hi everyone, 

Recently we decided to upgrade from flink 1.7.2 to 1.8.1. After an upgrade our task managers started to fail with SIGSEGV error from time to time.

In process of adjusting the code to 1.8.1, we noticed that there were some changes around TypeSerializerSnapshot interface and its implementations. At that time we had a few custom serializers which we decided to throw out during migration and then leverage flink default serializers. We don't mind clearing the state in the process of migration, an effort to migrate with state seems to be not worth it. 

Unfortunately after running new version we see SIGSEGV errors from time to time. It may be that serialization is not the real cause, but at the moment it seems to be the most probable reason. We have not performed any significant code changes besides serialization area.

We run job on yarn, hdp version 2.7.3.2.6.2.0-205. 
Checkpoint configuration: RocksDB backend, not incremental, 50s min processing time

You can find parts of JobManager log and ErrorFile log of failed container included below. 

Any suggestions are welcome

Best regards
Marek Maj

jobmanager.log

019-09-10 16:30:28.177 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 47 for job c8a9ae03785ade86348c3189cf7dd965 (18532488122 bytes in 60871 ms).

2019-09-10 16:31:19.223 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 48 @ 1568111478177 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:32:19.280 INFO  o.a.f.r.c.CheckpointCoordinator   - Completed checkpoint 48 for job c8a9ae03785ade86348c3189cf7dd965 (19049515705 bytes in 61083 ms).

2019-09-10 16:33:10.480 INFO  o.a.f.r.c.CheckpointCoordinator   - Triggering checkpoint 49 @ 1568111589279 for job c8a9ae03785ade86348c3189cf7dd965.

2019-09-10 16:33:36.773 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:33:48.782 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:00.802 WARN  o.a.f.r.r.h.l.m.MetricFetcherImpl   - Requesting TaskManager's path for query services failed.

java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)

at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:816)

at akka.dispatch.OnComplete.internal(Future.scala:258)

at akka.dispatch.OnComplete.internal(Future.scala:256)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)

at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)

at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

at java.lang.Thread.run(Thread.java:745)

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#374570759]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

... 9 common frames omitted

2019-09-10 16:34:03.800 INFO  o.a.flink.yarn.YarnResourceManager   - The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023 timed out.

2019-09-10 16:34:03.801 INFO  o.a.flink.yarn.YarnResourceManager   - Closing TaskExecutor connection container_e67_1568017536744_0044_01_000023 because: The heartbeat of TaskManager with id container_e67_1568017536744_0044_01_000023  timed out.

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - my-function (1/32) (ae416d03ddc94a3633673c4050b8f2ae) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.c.CheckpointCoordinator   - Discarding checkpoint 49 of job c8a9ae03785ade86348c3189cf7dd965.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2019-09-10 16:34:03.803 INFO  o.a.f.r.e.ExecutionGraph   - Job ProcessingJob (c8a9ae03785ade86348c3189cf7dd965) switched from state RUNNING to FAILING.

org.apache.flink.util.FlinkException: The assigned slot container_e67_1568017536744_0044_01_000023_0 was removed.

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1080)

at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)

at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1187)

at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)

at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)

at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)

at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

at akka.actor.ActorCell.invoke(ActorCell.scala:495)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

at akka.dispatch.Mailbox.run(Mailbox.scala:224)

at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



hs_err_pid_262348.log for failed container

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f294944b2c2, pid=262348, tid=0x00007f2916833700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_112-b15) (build 1.8.0_112-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libzip.so+0xb2c2]  inflateEnd+0x32
#
# Core dump written. Default location: /data/hadoop/yarn/local/usercache/flink/appcache/application_1568017536744_0044/container_e67_1568017536744_0044_01_000023/core or core.262348
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f29440e8000):  JavaThread "Finalizer" daemon [_thread_in_native, id=262401, stack(0x00007f2916733000,0x00007f2916834000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000001080

Registers:
RAX=0x00007f0100000001, RBX=0x00007f2945e52770, RCX=0x0000000000000180, RDX=0x00007f2945e52770
RSP=0x00007f29168323d0, RBP=0x00007f29168323e0, RSI=0x0000000000001040, RDI=0x00007f2945e52770
R8 =0x00000007bff0f170, R9 =0x0000000000000006, R10=0x00007f2935017a08, R11=0x00007f294b583d50
R12=0x00007f29440e81f8, R13=0x00007f293135cc58, R14=0x00007f2916832490, R15=0x00007f29440e8000
RIP=0x00007f294944b2c2, EFLAGS=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f29168323d0)
0x00007f29168323d0:   ffffffff440e8000 00007f2945e52770
0x00007f29168323e0:   00007f2916832400 00007f294944338e
0x00007f29168323f0:   00007f293135cc58 0000000000000000
0x00007f2916832400:   00007f2916832468 00007f2935017a34
0x00007f2916832410:   00007f2916832540 00007f293501306d
0x00007f2916832420:   00007f29350055d0 00007f2916832428
0x00007f2916832430:   0000000000000000 00007f2916832490
0x00007f2916832440:   00007f293135cd70 0000000000000000
0x00007f2916832450:   00007f293135cc58 0000000000000000
0x00007f2916832460:   00007f2916832488 00007f29168324e8
0x00007f2916832470:   00007f29350082bd 00000006ab616900
0x00007f2916832480:   00007f2935011538 00007f2945e52770
0x00007f2916832490:   00000007bff0f1e8 00000007bff0f1e8
0x00007f29168324a0:   00000007bff0f1e8 00007f2916832498
0x00007f29168324b0:   00007f293135c5e5 00007f2916832518
0x00007f29168324c0:   00007f293135cd70 00007f29313f9840
0x00007f29168324d0:   00007f293135c618 00007f2916832488
0x00007f29168324e0:   00007f2916832518 00007f2916832580
0x00007f29168324f0:   00007f29350082bd 0000000000000000
0x00007f2916832500:   00007f2945e52770 0000000000000000
0x00007f2916832510:   00000007bff0f1e8 00000007bff0cd38
0x00007f2916832520:   0000000000000009 00000007bff0f158
0x00007f2916832530:   0000006ce4720709 00000007bff0cd98
0x00007f2916832540:   00007f2916832520 00007f293132f631
0x00007f2916832550:   00007f29168325d8 00007f2931330ce0
0x00007f2916832560:   0000000000000000 00007f293132f6c0
0x00007f2916832570:   00007f2916832518 00007f29168325d8
0x00007f2916832580:   00007f2916832620 00007f29350082bd
0x00007f2916832590:   0000000000000000 0000000000000000
0x00007f29168325a0:   0000000000000000 0000000000000000
0x00007f29168325b0:   0000000000000000 0000000000000000
0x00007f29168325c0:   00000007bff0f158 00000007bff0cd38 

Instructions: (pc=0x00007f294944b2c2)
0x00007f294944b2a2:   fe ff ff ff 48 83 c4 08 5b c9 c3 0f 1f 00 48 8b
0x00007f294944b2b2:   77 28 48 85 f6 74 e8 48 8b 47 38 48 85 c0 74 df
0x00007f294944b2c2:   48 8b 56 40 48 85 d2 74 11 48 89 d6 48 8b 7f 40
0x00007f294944b2d2:   ff d0 48 8b 43 38 48 8b 73 28 48 8b 7b 40 ff d0 

Register to memory mapping:

RAX=0x00007f0100000001 is an unknown value
RBX=0x00007f2945e52770 is an unknown value
RCX=0x0000000000000180 is an unknown value
RDX=0x00007f2945e52770 is an unknown value
RSP=0x00007f29168323d0 is pointing into the stack for thread: 0x00007f29440e8000
RBP=0x00007f29168323e0 is pointing into the stack for thread: 0x00007f29440e8000
RSI=0x0000000000001040 is an unknown value
RDI=0x00007f2945e52770 is an unknown value
R8 =0x00000007bff0f170 is an oop
[Ljava.lang.Object; 
 - klass: 'java/lang/Object'[]
 - length: 16
R9 =0x0000000000000006 is an unknown value
R10=0x00007f2935017a08 is at code_begin+808 in an Interpreter codelet
method entry point (kind = native)  [0x00007f29350176e0, 0x00007f2935017fe0]  2304 bytes
R11=0x00007f294b583d50: <offset 0x9c3d50> in /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so at 0x00007f294abc0000
R12=0x00007f29440e81f8 is an unknown value
R13={method} {0x00007f293135cc58} 'end' '(J)V' in 'java/util/zip/Inflater'
R14=0x00007f2916832490 is pointing into the stack for thread: 0x00007f29440e8000
R15=0x00007f29440e8000 is a thread


Stack: [0x00007f2916733000,0x00007f2916834000],  sp=0x00007f29168323d0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libzip.so+0xb2c2]  inflateEnd+0x32
C  [libzip.so+0x338e]  Java_java_util_zip_Inflater_end+0x1e
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x690c66]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x691171]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x691617]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x72c990]  thread_entry(JavaThread*, Thread*)+0xa0
V  [libjvm.so+0xa755f3]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7573c]  JavaThread::run()+0x11c
V  [libjvm.so+0x926138]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  java.util.zip.Inflater.end(J)V+0
j  java.util.zip.Inflater.end()V+29
j  java.util.zip.ZipFile.close()V+169
j  sun.net.www.protocol.jar.URLJarFile.close()V+18
j  sun.net.www.protocol.jar.URLJarFile.finalize()V+1
J 9535% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f293674cec0 [0x00007f293674cc00+0x2c0]
v  ~StubRoutines::call_stub