fault tolerance: suspend and resume?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

fault tolerance: suspend and resume?

Istvan Soos
Hi,

I was wondering how Flink's fault tolerance works, because this page
is short on the details:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html

My environment has a backend service that may be out for a couple of
hours (sad, but working on fixing that). I have a sink that would like
to write to that service, and in such cases it throws an exception.
This brings the process down and I need to manually intervene to get
it up and running again.

I was thinking to rewrite the sink to loop until it is able to write
the data (and have a multi-hour long tolarence before it throws an
exception). I hope that it will create a backpressure on the process,
"suspend" the processing and "resume" it when the backend service goes
up again.

Am I right with that assumption? Is there a better way to make
suspending and resuming automatic?

Thanks,
  Istvan
Reply | Threaded
Open this post in threaded view
|

Re: fault tolerance: suspend and resume?

Ufuk Celebi
Yes, the back pressure behaviour you describe is correct. With
checkpointing enable, the job should resume as soon as the sink can
contact the backend service again (you would see that the job fails
many times until the service is live again, but at the end it should
work). You can control the restart behaviour via the restart
strategies (https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html).

For jobs that are in state RUNNING you can trigger savepoints
(https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html).
This does not work for jobs in any other state (e.g.
finished/failed/cancelled etc.) yet, so it wouldn't work after your
job has thrown the Exception. Savepoints are triggered from the
outside of your application via the CLI. You might want to write a
script that triggers savepoints periodically and which monitors your
Flink job and backend service. If the Flink job fails, the script
would wait for the backend service to be live again and then re-submit
the job from the most recent savepoint. Maybe this helps as a work
around.

– Ufuk

On Wed, Jul 27, 2016 at 10:10 AM, Istvan Soos <[hidden email]> wrote:

> Hi,
>
> I was wondering how Flink's fault tolerance works, because this page
> is short on the details:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html
>
> My environment has a backend service that may be out for a couple of
> hours (sad, but working on fixing that). I have a sink that would like
> to write to that service, and in such cases it throws an exception.
> This brings the process down and I need to manually intervene to get
> it up and running again.
>
> I was thinking to rewrite the sink to loop until it is able to write
> the data (and have a multi-hour long tolarence before it throws an
> exception). I hope that it will create a backpressure on the process,
> "suspend" the processing and "resume" it when the backend service goes
> up again.
>
> Am I right with that assumption? Is there a better way to make
> suspending and resuming automatic?
>
> Thanks,
>   Istvan