Checkpoint failure

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint failure

Navneeth Krishnan
Hi All,

Occasionally I run into failed checkpoints error where 2 or 3 consecutive checkpoints fails after running for a minute and then it recovers. This is causing delay in processing the incoming data since there is huge amount of data buffered during the failed checkpoints. I don't see any errors in the taskmanager logs but here is the error in the jobmanager log. The state size is around 100 mb.

Checkpoint configuration:
OptionValue
Checkpointing ModeExactly Once
Interval1m 0s
Timeout1m 0s
Minimum Pause Between Checkpoints5s
Maximum Concurrent Checkpoints1
Persist Checkpoints ExternallyEnabled (retain on cancellation)
Jobmanager Log:

2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042 bytes in 50707 ms).
2019-07-05 17:54:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171
2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470 bytes in 6430 ms).
2019-07-05 17:55:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171
2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543 bytes in 5551 ms).
2019-07-05 17:56:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint failure

Navneeth Krishnan
Hi All,

Any pointers on the below checkpoint failure scenario. Appreciate all the help. Thanks

Thanks

On Sun, Jul 7, 2019 at 9:23 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Occasionally I run into failed checkpoints error where 2 or 3 consecutive checkpoints fails after running for a minute and then it recovers. This is causing delay in processing the incoming data since there is huge amount of data buffered during the failed checkpoints. I don't see any errors in the taskmanager logs but here is the error in the jobmanager log. The state size is around 100 mb.

Checkpoint configuration:
OptionValue
Checkpointing ModeExactly Once
Interval1m 0s
Timeout1m 0s
Minimum Pause Between Checkpoints5s
Maximum Concurrent Checkpoints1
Persist Checkpoints ExternallyEnabled (retain on cancellation)
Jobmanager Log:

2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042 bytes in 50707 ms).
2019-07-05 17:54:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171
2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470 bytes in 6430 ms).
2019-07-05 17:55:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171
2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543 bytes in 5551 ms).
2019-07-05 17:56:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint failure

Biao Liu
Hi,

It seems that sometimes your job can not finish checkpoint during 1 minute (timeout in your configuration).
The checkpoint of 9867 is expired. The 9868 costs 50 seconds, almost reaches the timeout limit. However the 9869 and 9870 finished quite soon.

It's more like a performance issue. So it's hard to get a collusion based on your simple description.
It's might be helpful to check more details of your job. For example, the garbage collection of your Task Manager, the checkpoint alignment (check the metric: checkpointAlignmentTime), the distribution of your input data, the performance of state backend, etc.


Navneeth Krishnan <[hidden email]> 于2019年7月14日周日 上午5:01写道:
Hi All,

Any pointers on the below checkpoint failure scenario. Appreciate all the help. Thanks

Thanks

On Sun, Jul 7, 2019 at 9:23 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Occasionally I run into failed checkpoints error where 2 or 3 consecutive checkpoints fails after running for a minute and then it recovers. This is causing delay in processing the incoming data since there is huge amount of data buffered during the failed checkpoints. I don't see any errors in the taskmanager logs but here is the error in the jobmanager log. The state size is around 100 mb.

Checkpoint configuration:
OptionValue
Checkpointing ModeExactly Once
Interval1m 0s
Timeout1m 0s
Minimum Pause Between Checkpoints5s
Maximum Concurrent Checkpoints1
Persist Checkpoints ExternallyEnabled (retain on cancellation)
Jobmanager Log:

2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN o.a.f.r.c.CheckpointCoordinator - Received late message for now expired checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job 00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042 bytes in 50707 ms).
2019-07-05 17:54:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171
2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470 bytes in 6430 ms).
2019-07-05 17:55:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171
2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543 bytes in 5551 ms).
2019-07-05 17:56:04,174 [Checkpoint Timer] INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171

Thanks