Retrieving name of last external checkpoint directory

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Retrieving name of last external checkpoint directory

Dawid Wysakowicz
Hi,

We are running few jobs on yarn and in case of some failure (that the job could not recover from on its own) we want to use last successful external checkpoint to restore the job from manually. The problem is that the
${state.checkpoints.dir} contains checkpoint directories for all jobs that we are running. How can we find out the last successful external checkpoint for some particular job? Will be grateful for any pointers.

Regards,
Dawid

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving name of last external checkpoint directory

Aljoscha Krettek
Hi,

I think there is currently no easy way of doing this. Things that come to mind are:
 - looking at the JM log
 - polling the JM REST interface for completed externalised checkpoints

The good news is that Flink 1.5 will rework how externalised checkpoints work a bit: basically, all checkpoints can now be considered externalised and the metadata will be stored in the root directory of the checkpoint, not in one global directory for all jobs. This way, the metadata for externalised checkpoints resides in the checkpoint directory of each job and it should be reasonably simple to restore from that.

Best,
Aljoscha

> On 15. Feb 2018, at 10:55, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi,
>
> We are running few jobs on yarn and in case of some failure (that the job could not recover from on its own) we want to use last successful external checkpoint to restore the job from manually. The problem is that the
> ${state.checkpoints.dir} contains checkpoint directories for all jobs that we are running. How can we find out the last successful external checkpoint for some particular job? Will be grateful for any pointers.
>
> Regards,
> Dawid

Reply | Threaded
Open this post in threaded view
|

Re: Retrieving name of last external checkpoint directory

Chesnay Schepler
There is the "lastCheckpointExternalPath" metric that is scoped by job. You could access this via JMX.

On 20.02.2018 17:17, Aljoscha Krettek wrote:
Hi,

I think there is currently no easy way of doing this. Things that come to mind are:
 - looking at the JM log
 - polling the JM REST interface for completed externalised checkpoints

The good news is that Flink 1.5 will rework how externalised checkpoints work a bit: basically, all checkpoints can now be considered externalised and the metadata will be stored in the root directory of the checkpoint, not in one global directory for all jobs. This way, the metadata for externalised checkpoints resides in the checkpoint directory of each job and it should be reasonably simple to restore from that.

Best,
Aljoscha

On 15. Feb 2018, at 10:55, Dawid Wysakowicz [hidden email] wrote:

Hi,

We are running few jobs on yarn and in case of some failure (that the job could not recover from on its own) we want to use last successful external checkpoint to restore the job from manually. The problem is that the
${state.checkpoints.dir} contains checkpoint directories for all jobs that we are running. How can we find out the last successful external checkpoint for some particular job? Will be grateful for any pointers.

Regards,
Dawid


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving name of last external checkpoint directory

Dawid Wysakowicz
Thx for your suggestions. In the end I’ve integrated altering flink-conf.yaml into job submission, which we do always via some custom ansible scripts. This way each job has its own directory for external checkpoints.

Best,
Dawid

> On 20 Feb 2018, at 17:21, Chesnay Schepler <[hidden email]> wrote:
>
> There is the "lastCheckpointExternalPath" metric that is scoped by job. You could access this via JMX.
>
> On 20.02.2018 17:17, Aljoscha Krettek wrote:
>> Hi,
>>
>> I think there is currently no easy way of doing this. Things that come to mind are:
>>  - looking at the JM log
>>  - polling the JM REST interface for completed externalised checkpoints
>>
>> The good news is that Flink 1.5 will rework how externalised checkpoints work a bit: basically, all checkpoints can now be considered externalised and the metadata will be stored in the root directory of the checkpoint, not in one global directory for all jobs. This way, the metadata for externalised checkpoints resides in the checkpoint directory of each job and it should be reasonably simple to restore from that.
>>
>> Best,
>> Aljoscha
>>
>>
>>> On 15. Feb 2018, at 10:55, Dawid Wysakowicz <[hidden email]>
>>>  wrote:
>>>
>>> Hi,
>>>
>>> We are running few jobs on yarn and in case of some failure (that the job could not recover from on its own) we want to use last successful external checkpoint to restore the job from manually. The problem is that the
>>> ${state.checkpoints.dir} contains checkpoint directories for all jobs that we are running. How can we find out the last successful external checkpoint for some particular job? Will be grateful for any pointers.
>>>
>>> Regards,
>>> Dawid
>>>
>>
>


signature.asc (849 bytes) Download Attachment