Checkpoint error - "The job has failed"

classic Classic list List threaded Threaded
6 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Checkpoint error - "The job has failed"

Dan
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint error - "The job has failed"

rmetzger0
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <[hidden email]> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint error - "The job has failed"

Yun Tang
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.



Best
Yun Tang

From: Robert Metzger <[hidden email]>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <[hidden email]> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Dan
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint error - "The job has failed"

Dan
Hey Yun and Robert,

I'm using Flink v1.11.1.

Robert, I'll send you a separate email with the logs.

On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <[hidden email]> wrote:
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.



Best
Yun Tang

From: Robert Metzger <[hidden email]>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <[hidden email]> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint error - "The job has failed"

Yun Tang
Hi Dan,

You could refer to the "Fix Versions" in FLINK-16753 [1] and know that this bug is resolved after 1.11.3 not 1.11.1.


Best
Yun Tang

From: Dan Hill <[hidden email]>
Sent: Tuesday, April 27, 2021 7:50
To: Yun Tang <[hidden email]>
Cc: Robert Metzger <[hidden email]>; user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hey Yun and Robert,

I'm using Flink v1.11.1.

Robert, I'll send you a separate email with the logs.

On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <[hidden email]> wrote:
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.



Best
Yun Tang

From: Robert Metzger <[hidden email]>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <[hidden email]> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Dan
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint error - "The job has failed"

Dan
Oh interesting.  Yea, could be.  We'll soon update to v1.12.  Thanks Robert and Yun!

On Wed, Apr 28, 2021 at 1:30 AM Yun Tang <[hidden email]> wrote:
Hi Dan,

You could refer to the "Fix Versions" in FLINK-16753 [1] and know that this bug is resolved after 1.11.3 not 1.11.1.


Best
Yun Tang

From: Dan Hill <[hidden email]>
Sent: Tuesday, April 27, 2021 7:50
To: Yun Tang <[hidden email]>
Cc: Robert Metzger <[hidden email]>; user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hey Yun and Robert,

I'm using Flink v1.11.1.

Robert, I'll send you a separate email with the logs.

On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <[hidden email]> wrote:
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.



Best
Yun Tang

From: Robert Metzger <[hidden email]>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Checkpoint error - "The job has failed"
 
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <[hidden email]> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?