After job cancel, leftover ZK state prevents job manager startup

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

After job cancel, leftover ZK state prevents job manager startup

Micah Wylde
Hello,

We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:

1. Job is cancelled with a savepoint
2. The jar is deleted from our HA blobstore (S3)
3. The jobgraph in ZK is *not* deleted
4. We restart the cluster
5. Startup fails in recovery because the jar is not available, with the stacktrace:

00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
{{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
{{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
{{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
{{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
{{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
{{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
{{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
{{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
{{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0

This superficially seems similar to several issues that have apparently been fixed in 1.5.4, likeĀ FLINK-10255 andĀ FLINK-10184.

Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?

Thanks,
Micah
Reply | Threaded
Open this post in threaded view
|

Re: After job cancel, leftover ZK state prevents job manager startup

Stefan Richter
Hi,

Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?

Best,
Stefan

> On 11. Dec 2018, at 05:19, Micah Wylde <[hidden email]> wrote:
>
> Hello,
>
> We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:
>
> 1. Job is cancelled with a savepoint
> 2. The jar is deleted from our HA blobstore (S3)
> 3. The jobgraph in ZK is *not* deleted
> 4. We restart the cluster
> 5. Startup fails in recovery because the jar is not available, with the stacktrace:
>
> 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
> {{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
> {{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
> {{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
> {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
> {{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
> {{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
> {{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
>
> This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
>
> Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?
>
> Thanks,
> Micah

Reply | Threaded
Open this post in threaded view
|

Re: After job cancel, leftover ZK state prevents job manager startup

Till Rohrmann
Hi Micah,

the problem looks indeed similar to FLINK-10255. Could you tell me your exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of all JobManagers on DEBUG level would be helpful for further debugging.

Cheers,
Till

On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter <[hidden email]> wrote:
Hi,

Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?

Best,
Stefan

> On 11. Dec 2018, at 05:19, Micah Wylde <[hidden email]> wrote:
>
> Hello,
>
> We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:
>
> 1. Job is cancelled with a savepoint
> 2. The jar is deleted from our HA blobstore (S3)
> 3. The jobgraph in ZK is *not* deleted
> 4. We restart the cluster
> 5. Startup fails in recovery because the jar is not available, with the stacktrace:
>
> 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
> {{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
> {{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
> {{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
> {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
> {{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
> {{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
> {{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
>
> This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
>
> Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?
>
> Thanks,
> Micah

Reply | Threaded
Open this post in threaded view
|

Re: After job cancel, leftover ZK state prevents job manager startup

Micah Wylde
We have a single Jobmanager in an HA setup. From looking at logs and metrics, it appears that before the issue occurred there was a long (15s) GC pause on the jobmanager, which then caused a leadership election. Because there is only one jobmanager, the same one became leader again after it recovered.

In the logs, we see:

2018-11-15 19:28:10.039 INFO Client session timed out, have not heard from server in 27384ms for sessionid 0x3655a5782843c03, closing socket connection and attempting reconnect
2018-11-15 19:28:10.138 INFO State change: SUSPENDED
2018-11-15 19:28:10.141 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.144 INFO JobManager for job {JOB}. Interval: 300000 ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min (853c2f252ac2e44de63025ec36efefdc) was revoked leadership at akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/dispatcher no longer participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0 no longer participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender http://localhost:18080 no longer participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The contender akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/resourcemanager no longer participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
2018-11-15 19:28:10.148 INFO http://localhost:18080 lost leadership
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.440 WARN SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-265845598028730621.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-11-15 19:28:10.440 INFO Opening socket connection to server 10.0.39.156/10.0.39.156:2181
2018-11-15 19:28:10.440 ERROR Authentication failed
2018-11-15 19:28:10.441 INFO Socket connection established to 10.0.39.156/10.0.39.156:2181, initiating session
2018-11-15 19:28:10.459 INFO Session establishment complete on server 10.0.39.156/10.0.39.156:2181, sessionid = 0x3655a5782843c03, negotiated timeout = 40000
2018-11-15 19:28:10.459 INFO State change: RECONNECTED
2018-11-15 19:28:10.460 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-11-15 19:28:10.470 INFO JobManager runner for {job}. Interval: 300000 ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min (853c2f252ac2e44de63025ec36efefdc) was granted leadership with session id 3409e3b7-8824-4352-a44d-9dac4b101221 at akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.480 INFO http://localhost:18080 was granted leadership with leaderSessionID=764d2013-5812-4721-be24-1c47a58c0314




On Tue, Dec 11, 2018 at 1:35 AM Till Rohrmann <[hidden email]> wrote:
Hi Micah,

the problem looks indeed similar to FLINK-10255. Could you tell me your exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of all JobManagers on DEBUG level would be helpful for further debugging.

Cheers,
Till

On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter <[hidden email]> wrote:
Hi,

Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?

Best,
Stefan

> On 11. Dec 2018, at 05:19, Micah Wylde <[hidden email]> wrote:
>
> Hello,
>
> We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:
>
> 1. Job is cancelled with a savepoint
> 2. The jar is deleted from our HA blobstore (S3)
> 3. The jobgraph in ZK is *not* deleted
> 4. We restart the cluster
> 5. Startup fails in recovery because the jar is not available, with the stacktrace:
>
> 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
> {{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
> {{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
> {{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
> {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
> {{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
> {{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
> {{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
>
> This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
>
> Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?
>
> Thanks,
> Micah