The Flink job recovered with wrong checkpoint state.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

The Flink job recovered with wrong checkpoint state.

Thomas Huang
Hi Flink Community,

Currently, I'm using yarn-cluster mode to submit flink job on yarn, and I haven't set high availability configuration (zookeeper), but set restart strategy:

 env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 3000))

the attempt time is 10 and the wait time 30 seconds per failure.

Today, when Infra team was rolling restart the yarn platform. Although the job manager restarted, the job hadn't recovered from the latest checkpoint, and all task managers started from the default job configuration that was not excepted.

Does it mean I have to setup high availability configuration for yarn-cluster mode, or Is there any bug?

Best Wish.


Reply | Threaded
Open this post in threaded view
|

Re: The Flink job recovered with wrong checkpoint state.

Yun Tang
Hi Thomas

The answer is yes. Without high availability, once the job manager is down and even the job manager is relaunched via YARN, the job graph and last checkpoint would not be recovered.

Best
Yun Tang

From: Thomas Huang <[hidden email]>
Sent: Sunday, June 14, 2020 22:58
To: Flink <[hidden email]>
Subject: The Flink job recovered with wrong checkpoint state.
 
Hi Flink Community,

Currently, I'm using yarn-cluster mode to submit flink job on yarn, and I haven't set high availability configuration (zookeeper), but set restart strategy:

 env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 3000))

the attempt time is 10 and the wait time 30 seconds per failure.

Today, when Infra team was rolling restart the yarn platform. Although the job manager restarted, the job hadn't recovered from the latest checkpoint, and all task managers started from the default job configuration that was not excepted.

Does it mean I have to setup high availability configuration for yarn-cluster mode, or Is there any bug?

Best Wish.


Reply | Threaded
Open this post in threaded view
|

Re: The Flink job recovered with wrong checkpoint state.

Thomas Huang
[hidden email],Thanks.

From: Yun Tang <[hidden email]>
Sent: Monday, June 15, 2020 11:30
To: Thomas Huang <[hidden email]>; Flink <[hidden email]>
Subject: Re: The Flink job recovered with wrong checkpoint state.
 
Hi Thomas

The answer is yes. Without high availability, once the job manager is down and even the job manager is relaunched via YARN, the job graph and last checkpoint would not be recovered.

Best
Yun Tang

From: Thomas Huang <[hidden email]>
Sent: Sunday, June 14, 2020 22:58
To: Flink <[hidden email]>
Subject: The Flink job recovered with wrong checkpoint state.
 
Hi Flink Community,

Currently, I'm using yarn-cluster mode to submit flink job on yarn, and I haven't set high availability configuration (zookeeper), but set restart strategy:

 env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 3000))

the attempt time is 10 and the wait time 30 seconds per failure.

Today, when Infra team was rolling restart the yarn platform. Although the job manager restarted, the job hadn't recovered from the latest checkpoint, and all task managers started from the default job configuration that was not excepted.

Does it mean I have to setup high availability configuration for yarn-cluster mode, or Is there any bug?

Best Wish.