(DEPRECATED) Apache Flink User Mailing List archive.

Externalized checkpoints and metadata

Classic

List

Threaded

4 messages Options

Juan Gentile

Externalized checkpoints and metadata

Hello,

We are trying to use externalized checkpoints, using RocksDB on Hadoop hdfs.

We would like to know what is the proper way to resume from a saved checkpoint as we are currently running many jobs in the same flink cluster.

The problem is that when we want to restart the jobs and pass the metadata file (or directory) there is 1 file per job but they are not easily identifiable based on the name:

Example

/checkpoints/checkpoint_metadata-69053704a5ca

/checkpoints/checkpoint_metadata-c7c016909607

We are not using savepoints and reading the documentation I see there are 2 ways to resume, 1 passing the metadata file (not possible as we have many jobs) and the other passing the directory,

But by default it looks for a _metadata file which doesn’t exist.

Thank you,

Juan G.

hao gao

Re: Externalized checkpoints and metadata

Hi Juan,

We modified the flink code a little bit to change the flink checkpoint structure so we can easily identify which is which

you can read my note or the PR

https://medium.com/hadoop-noob/flink-externalized-checkpoint-eb86e693cfed

https://github.com/BranchMetrics/flink/pull/6/files

Hope it helps

Thanks

Hao

2018-04-25 6:07 GMT-07:00 Juan Gentile <[hidden email]>:

Hello,

We are trying to use externalized checkpoints, using RocksDB on Hadoop hdfs.

We would like to know what is the proper way to resume from a saved checkpoint as we are currently running many jobs in the same flink cluster.

The problem is that when we want to restart the jobs and pass the metadata file (or directory) there is 1 file per job but they are not easily identifiable based on the name:

Example

/checkpoints/checkpoint_metadata-69053704a5ca

/checkpoints/checkpoint_metadata-c7c016909607

We are not using savepoints and reading the documentation I see there are 2 ways to resume, 1 passing the metadata file (not possible as we have many jobs) and the other passing the directory,

But by default it looks for a _metadata file which doesn’t exist.

Thank you,

Juan G.

Thanks

- Hao

Juan Gentile

Re: Externalized checkpoints and metadata

Hello all,

Thank you all for your responses, I’ll take a look at your code Hao, and probably implement something similar.

I’d like to ask though, so as to know what we could expect from Flink in the future, if this issue will be addressed somehow, considering that we have already 3 different companies implementing different (but similar) solutions to solve the same problem.

Maybe we could think of adding this issue to here: <a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-10%3A+Unify+Checkpoints+and+Savepoints"> https://cwiki.apache.org/confluence/display/FLINK/FLIP-10%3A+Unify+Checkpoints+and+Savepoints ?

Thank you,

Juan G.

From: hao gao <[hidden email]>
Date: Wednesday, 25 April 2018 at 20:25
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: Externalized checkpoints and metadata

Hi Juan,

We modified the flink code a little bit to change the flink checkpoint structure so we can easily identify which is which

you can read my note or the PR

https://medium.com/hadoop-noob/flink-externalized-checkpoint-eb86e693cfed

https://github.com/BranchMetrics/flink/pull/6/files

Hope it helps

Thanks

Hao

2018-04-25 6:07 GMT-07:00 Juan Gentile <[hidden email]>:

Hello,

We are trying to use externalized checkpoints, using RocksDB on Hadoop hdfs.

We would like to know what is the proper way to resume from a saved checkpoint as we are currently running many jobs in the same flink cluster.

The problem is that when we want to restart the jobs and pass the metadata file (or directory) there is 1 file per job but they are not easily identifiable based on the name:

Example

/checkpoints/checkpoint_metadata-69053704a5ca

/checkpoints/checkpoint_metadata-c7c016909607

We are not using savepoints and reading the documentation I see there are 2 ways to resume, 1 passing the metadata file (not possible as we have many jobs) and the other passing the directory,

But by default it looks for a _metadata file which doesn’t exist.

Thank you,

Juan G.

Thanks

- Hao

gerryzhou

Re: Externalized checkpoints and metadata

Hi Juan,

I think you are right and there maybe more then 3 companies implementing different solutions for this...I created a ticket to address it here https://issues.apache.org/jira/browse/FLINK-9260. Hope this could help to reduce other's redundant efforts on this...(If it could be accepted by community finally)

Best Regards,

Sihua Zhou

On 04/26/2018 16:35，[hidden email] wrote：

Hello all,

Thank you all for your responses, I’ll take a look at your code Hao, and probably implement something similar.

I’d like to ask though, so as to know what we could expect from Flink in the future, if this issue will be addressed somehow, considering that we have already 3 different companies implementing different (but similar) solutions to solve the same problem.

Maybe we could think of adding this issue to here: https://cwiki.apache.org/confluence/display/FLINK/FLIP-10%3A+Unify+Checkpoints+and+Savepoints ?

Thank you,

Juan G.

From: hao gao <[hidden email]>
Date: Wednesday, 25 April 2018 at 20:25
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: Externalized checkpoints and metadata

Hi Juan,

We modified the flink code a little bit to change the flink checkpoint structure so we can easily identify which is which

you can read my note or the PR

https://medium.com/hadoop-noob/flink-externalized-checkpoint-eb86e693cfed

https://github.com/BranchMetrics/flink/pull/6/files

Hope it helps

Thanks

Hao

2018-04-25 6:07 GMT-07:00 Juan Gentile <[hidden email]>:

Hello,

We are trying to use externalized checkpoints, using RocksDB on Hadoop hdfs.

We would like to know what is the proper way to resume from a saved checkpoint as we are currently running many jobs in the same flink cluster.

The problem is that when we want to restart the jobs and pass the metadata file (or directory) there is 1 file per job but they are not easily identifiable based on the name:

Example

/checkpoints/checkpoint_metadata-69053704a5ca

/checkpoints/checkpoint_metadata-c7c016909607

We are not using savepoints and reading the documentation I see there are 2 ways to resume, 1 passing the metadata file (not possible as we have many jobs) and the other passing the directory,

But by default it looks for a _metadata file which doesn’t exist.

Thank you,

Juan G.

--

Thanks

- Hao