Hi,
I was wondering how Flink's fault tolerance works, because this page is short on the details: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html My environment has a backend service that may be out for a couple of hours (sad, but working on fixing that). I have a sink that would like to write to that service, and in such cases it throws an exception. This brings the process down and I need to manually intervene to get it up and running again. I was thinking to rewrite the sink to loop until it is able to write the data (and have a multi-hour long tolarence before it throws an exception). I hope that it will create a backpressure on the process, "suspend" the processing and "resume" it when the backend service goes up again. Am I right with that assumption? Is there a better way to make suspending and resuming automatic? Thanks, Istvan |
Yes, the back pressure behaviour you describe is correct. With
checkpointing enable, the job should resume as soon as the sink can contact the backend service again (you would see that the job fails many times until the service is live again, but at the end it should work). You can control the restart behaviour via the restart strategies (https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/fault_tolerance.html). For jobs that are in state RUNNING you can trigger savepoints (https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html). This does not work for jobs in any other state (e.g. finished/failed/cancelled etc.) yet, so it wouldn't work after your job has thrown the Exception. Savepoints are triggered from the outside of your application via the CLI. You might want to write a script that triggers savepoints periodically and which monitors your Flink job and backend service. If the Flink job fails, the script would wait for the backend service to be live again and then re-submit the job from the most recent savepoint. Maybe this helps as a work around. – Ufuk On Wed, Jul 27, 2016 at 10:10 AM, Istvan Soos <[hidden email]> wrote: > Hi, > > I was wondering how Flink's fault tolerance works, because this page > is short on the details: > https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html > > My environment has a backend service that may be out for a couple of > hours (sad, but working on fixing that). I have a sink that would like > to write to that service, and in such cases it throws an exception. > This brings the process down and I need to manually intervene to get > it up and running again. > > I was thinking to rewrite the sink to loop until it is able to write > the data (and have a multi-hour long tolarence before it throws an > exception). I hope that it will create a backpressure on the process, > "suspend" the processing and "resume" it when the backend service goes > up again. > > Am I right with that assumption? Is there a better way to make > suspending and resuming automatic? > > Thanks, > Istvan |
Free forum by Nabble | Edit this page |