State Recovery when job fails and auto-recovers

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

State Recovery when job fails and auto-recovers

Sameer Wadkar
Hi,

We have a job which is using ValueState. We have turned off checkpoints. The state is backed by rocksdb which is backed by S3.

 If the job fails for any exception (ex. Partitions not available or an occasional S3 404 error) and auto-recovers, is the entire state lost or does it continue from the last saved state. We see that the job has the same identifier. We don’t mind losing data during the small interval when the job is recovering. But because we are using ValueState as a custom global window to accumulate state for a key over a 3 hour window we don’t want to lose all of it.

Checkpointing is not an option because it takes longer per checkpoint and the state is huge.

Thanks,
Sameer

Sent from my iPhone
Reply | Threaded
Open this post in threaded view
|

Re: State Recovery when job fails and auto-recovers

Hequn Cheng
Hi Sameer,

In case of a failure, the job will restarts the operators and resets them to the latest successful checkpoint. So if you turn off checkpoints, all data will be lost.
Generally speaking, snapshots are very light-weight and can be drawn frequently without much impact on performance. If it do affect performance of your job and you don't want to lose all of your state, you can try to increase the checkpoint interval.
// start a checkpoint every 600000 ms (10min)
env.enableCheckpointing(600000);

Best, Hequn 

On Thu, Oct 18, 2018 at 7:19 AM Sameer Wadkar <[hidden email]> wrote:
Hi,

We have a job which is using ValueState. We have turned off checkpoints. The state is backed by rocksdb which is backed by S3.

 If the job fails for any exception (ex. Partitions not available or an occasional S3 404 error) and auto-recovers, is the entire state lost or does it continue from the last saved state. We see that the job has the same identifier. We don’t mind losing data during the small interval when the job is recovering. But because we are using ValueState as a custom global window to accumulate state for a key over a 3 hour window we don’t want to lose all of it.

Checkpointing is not an option because it takes longer per checkpoint and the state is huge.

Thanks,
Sameer

Sent from my iPhone