Flink 1.10 permanent JVM hang when stopped

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.10 permanent JVM hang when stopped

Hunter Herman

Hi Flink users!

 

TL;DR: My Flink taskmanagers frequently permanently hang in a shutdown handler’s Thread.sleep() call when I issue a stop. Hitting a wall trying to debug.  https://issues.apache.org/jira/browse/FLINK-17470

 

I’m really scratching my head at this issue. On a particular environment in which we have setup Flink 1.10 (on AWS boxes/centos7) with HA job managers, we’re running into an issue where the flink taskmanagers will sometimes (fairly often) enter a permanent hang when we try to stop them with the taskmanager script. This seems to be triggered by the org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run in a Thread.sleep() call. My googling turned up issues around hangs in Thread.sleep() being caused by deadlocks at an OS (?) level. The most obvious difference to me is that in our case every thread in the jvm is blocked on the pthread_wait() syscall.

 

Anyways, I’m at a loss here. If anyone in the flink community has ever seen an issue like this, would love to hear your insight! Stack traces & OS version information are in the linked ticket if anyones curious.

 

Thanks!

-Hunter Herman