Cancelled job not showing its details

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Cancelled job not showing its details

Julio Biason
Hello,

I had a job that was failing -- a bug on our code -- so I decided to cancel it and deploy the fix. Because I couldn't create a savepoint due the job restarting, I decided to kill it anyway and use the web interface to get the last successful checkpoint.

The problem is: the interface is not showing anything for the job. The details page show nothing, not even the pipeline.

The only thing that seems related in the JobManager logs is this:

2018-10-02 19:03:14,214 [flink-akka.actor.default-dispatcher-4158] ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Implementation error: Unhandled exception.
java.lang.IllegalArgumentException: Negative number of in progress checkpoints
        at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.<init>(CheckpointStatsCounts.java:72)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.createSnapshot(CheckpointStatsCounts.java:177)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsTracker.createSnapshot(CheckpointStatsTracker.java:166)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.getCheckpointStatsSnapshot(ExecutionGraph.java:553)
        at org.apache.flink.runtime.executiongraph.ArchivedExecutionGraph.createFrom(ArchivedExecutionGraph.java:340)
        at org.apache.flink.runtime.jobmaster.JobMaster.requestJob(JobMaster.java:923)
        at sun.reflect.GeneratedMethodAccessor101.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Screenshot_2018-10-02 Apache Flink Web Dashboard.png (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cancelled job not showing its details

Julio Biason
Oh, another piece of information:

Because the job was failing and restarting, I did a cancel via the CLI tool during one of the restarts.

On Tue, Oct 2, 2018 at 4:03 PM, Julio Biason <[hidden email]> wrote:
Hello,

I had a job that was failing -- a bug on our code -- so I decided to cancel it and deploy the fix. Because I couldn't create a savepoint due the job restarting, I decided to kill it anyway and use the web interface to get the last successful checkpoint.

The problem is: the interface is not showing anything for the job. The details page show nothing, not even the pipeline.

The only thing that seems related in the JobManager logs is this:

2018-10-02 19:03:14,214 [flink-akka.actor.default-dispatcher-4158] ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Implementation error: Unhandled exception.
java.lang.IllegalArgumentException: Negative number of in progress checkpoints
        at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.<init>(CheckpointStatsCounts.java:72)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.createSnapshot(CheckpointStatsCounts.java:177)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsTracker.createSnapshot(CheckpointStatsTracker.java:166)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.getCheckpointStatsSnapshot(ExecutionGraph.java:553)
        at org.apache.flink.runtime.executiongraph.ArchivedExecutionGraph.createFrom(ArchivedExecutionGraph.java:340)
        at org.apache.flink.runtime.jobmaster.JobMaster.requestJob(JobMaster.java:923)
        at sun.reflect.GeneratedMethodAccessor101.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: Cancelled job not showing its details

Andrey Zagrebin
Hi Julio,

this might be a bug in job stats. Can you please create an issue in Jira describing the steps you were doing and complete logs?

Best,
Andrey

On 2 Oct 2018, at 21:11, Julio Biason <[hidden email]> wrote:

Oh, another piece of information:

Because the job was failing and restarting, I did a cancel via the CLI tool during one of the restarts.

On Tue, Oct 2, 2018 at 4:03 PM, Julio Biason <[hidden email]> wrote:
Hello,

I had a job that was failing -- a bug on our code -- so I decided to cancel it and deploy the fix. Because I couldn't create a savepoint due the job restarting, I decided to kill it anyway and use the web interface to get the last successful checkpoint.

The problem is: the interface is not showing anything for the job. The details page show nothing, not even the pipeline.

The only thing that seems related in the JobManager logs is this:

2018-10-02 19:03:14,214 [flink-akka.actor.default-dispatcher-4158] ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Implementation error: Unhandled exception.
java.lang.IllegalArgumentException: Negative number of in progress checkpoints
        at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.<init>(CheckpointStatsCounts.java:72)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsCounts.createSnapshot(CheckpointStatsCounts.java:177)
        at org.apache.flink.runtime.checkpoint.CheckpointStatsTracker.createSnapshot(CheckpointStatsTracker.java:166)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.getCheckpointStatsSnapshot(ExecutionGraph.java:553)
        at org.apache.flink.runtime.executiongraph.ArchivedExecutionGraph.createFrom(ArchivedExecutionGraph.java:340)
        at org.apache.flink.runtime.jobmaster.JobMaster.requestJob(JobMaster.java:923)
        at sun.reflect.GeneratedMethodAccessor101.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


-- 
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" target="_blank" style="color: rgb(17, 85, 204); font-family: arial, sans-serif; font-size: 12.8px;" class="">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" target="_blank" style="color: rgb(17, 85, 204);" class="">+55 51 99907 0554



-- 
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" target="_blank" style="color: rgb(17, 85, 204); font-family: arial, sans-serif; font-size: 12.8px;" class="">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" target="_blank" style="color: rgb(17, 85, 204);" class="">+55 51 99907 0554