Wait a sec, I just checked out the code again and it seems we already do that: https://github.com/apache/flink/blob/9071e3befb8c279f73c3094c9f6bddc0e7cce9e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L210
If there were some checkpoints but none could be read we fail recovery.
|
As in, if there are chk handles in zk, there should no reason to start a new job ( bad handle, no hdfs connectivity etc ), yes that sums it up. On Wed, Jan 24, 2018 at 5:35 AM, Aljoscha Krettek <[hidden email]> wrote:
|
Did you see my second mail?
|
Sorry. There are 2 scenerios * Idempotent Sinks Use Case where we would want to restore from the latest valid checkpoint. If I understand the code correctly we try to retrieve all completed checkpoints for all handles in ZK and abort ( throw an exception ) if there are handles but no corresponding complete checkpoints in hdfs, else we use the latest valid checkpoint state. On abort a restart and thus restore of the pipe is issued repeating the above execution. If the failure in hdfs was transient a retry will succeed else when the retry limit is reached the pipeline is aborted for good. * Non Idempotent Sinks where we have no retries. We do not want to recover from the last available checkpoint as the above code will do as the more into history we go the more duplicates will be delivered. The only solution is use exactly once semantics of the source and sinks if possible. On Wed, Jan 24, 2018 at 7:20 AM, Aljoscha Krettek <[hidden email]> wrote:
|
To add to this, we are assuming that the default configuration will fail a pipeline if a checkpoint fails and will hit the recover loop only and only if the retry limit is not reached On Thu, Jan 25, 2018 at 7:00 AM, Vishal Santoshi <[hidden email]> wrote:
|
The assumption in your previous mail is correct.
Just to double check: - The initially affected version you were running was 1.3.2, correct? The issue should be fixed in all active branches (1.4, 1.5, 1.6) and additional in 1.3. Currently released versions with this fix: 1.4.0, 1.4.1 1.5.0 is in the makings. We are looking to create a dedicated 1.3.3 for this fix. On Thu, Jan 25, 2018 at 5:13 PM, Vishal Santoshi <[hidden email]> wrote:
|
Yes. We have not hit the snag in 1.4.0 ( our current version ). Again though this occurs under sustained down time on hadoop and it has been more stable lately :) On Wed, Mar 7, 2018 at 4:09 PM, Stephan Ewen <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |