(DEPRECATED) Apache Flink User Mailing List archive. - Uncaught exception in FatalExitExceptionHandler causing JM crash while canceling job

(DEPRECATED) Apache Flink User Mailing List archive.

Uncaught exception in FatalExitExceptionHandler causing JM crash while canceling job

Posted by Kelly Smith on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Uncaught-exception-in-FatalExitExceptionHandler-causing-JM-crash-while-canceling-job-tp40627.html

Hi folks,

I recently upgraded to Flink 1.12.0 and I’m hitting an issue where my JM is crashing while cancelling a job. This is causing Kubernetes readiness probes to fail, the JM to be restarted, and then get in a bad state while it tries to recover itself using ZK + a checkpoint which no longer exists.

This is the only information being logged before the process exits:

method: uncaughtException
   msg: FATAL: Thread 'cluster-io-thread-4' produced an uncaught exception. Stopping the process...
   pod: dev-dsp-flink-canary-test-9fa6d3e7-jm-59884f579-w8r6x
   stack: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@41554407 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@5d0ec6f7[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 25977] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622) at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) at org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:62) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1152) at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:58) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

https://github.com/apache/flink/blob/release-1.12.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointsCleaner.java#L58

I’m not sure how to debug this further, but it seems like an internal Flink bug?

More info:

Checkpoints are stored in S3 and I’m using the S3 connector
Identical code has been running on Flink 1.11.x for months with no issues

Thanks,

Kelly