Flink failing to restore from checkpoint

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink failing to restore from checkpoint

Claude Murad
Hello, 

I executed a flink job in a Kubernetes Application cluster w/ four taskmanagers.  The job was running fine for several hours but then crashed w/ the following exception which seems to be when restoring from a checkpoint.    The UI shows the following for the checkpoint counts: 

Triggered: 68In Progress: 0Completed: 67Failed: 1Restored: 292


Any ideas about this failure? 


Thanks



FlinkCheckpointFailure.txt (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink failing to restore from checkpoint

Piotr Nowojski-4
Hi,

What Flink version are you using and what is the scenario that's happening? It can be a number of things, most likely an issue that your filed mounted under:
> /mnt/checkpoints/5dde50b6e70608c63708cbf979bce4aa/shared/47993871-c7eb-4fec-ae23-207d307c384a
disappeared or stopped being accessible. For example something like this [1] (this is not a Flink bug).

Have you tried looking for this path manually? Does it exist? Have you looked in the JobManager/TaskManager logs for all entries that are referring to this path? 

To help you, we would need more information. If it has happened after taking a savepoint this could be a recently fixed issue [2]. If you are using SQL (Blink planner) it could be for example this [3].

Piotrek



pon., 29 mar 2021 o 14:58 Claude M <[hidden email]> napisał(a):
Hello, 

I executed a flink job in a Kubernetes Application cluster w/ four taskmanagers.  The job was running fine for several hours but then crashed w/ the following exception which seems to be when restoring from a checkpoint.    The UI shows the following for the checkpoint counts: 

Triggered: 68In Progress: 0Completed: 67Failed: 1Restored: 292


Any ideas about this failure? 


Thanks