My flink job runs in kubernetes. This is the setup:
1. One job running as a job cluster with one job manager 2. HA powered by zookeeper (works fine) 3. Job/Deployment manifests stored in Github and deployed to kubernetes by Argo 4. State persisted to S3 If I were to stop (drain and take a savepoint) and resume, I'll have to update the job manager manifest with the savepoint location and save it in Github and redeploy. After deployment, I'll presumably have to modify the manifest again to remove the savepoint location so as to avoid starting the application from the same savepoint. This raises some questions: 1. If the job manager were to crash before the manifest is updated again then won't kubernetes restart the job manager from the savepoint rather than the latest checkpoint? 2. Is there a way to ensure that restoration from a savepoint doesn't happen more than once? Or not after first successful checkpoint? 3. If even one checkpoint has been finalized, then the job should prefer the checkpoint rather than the savepoint. Will that happen automatically given zookeeper? 4. Is it possible to not have to remove the savepoint path from the kubernetes manifest and simply rely on newer checkpoints/savepoints? It feels rather clumsy to have to add and remove back manually. We could use a cron job to remove it but its still clumsy. 5. Is there a way of asking flink to use the latest savepoint rather than specifying the location of the savepoint? If I were to manually rename the s3 savepoint location to something fixed (s3://fixed_savepoint_path_always) then would there be any problem restoring the job? 6. Any open source tool that solves this problem? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi, if you start a Flink job from a savepoint and the job needs to recover, then it will only reuse the savepoint if no later checkpoint has been created. Flink will always use the latest checkpoint/savepoint taken. Cheers, Till On Wed, Dec 16, 2020 at 9:47 PM vishalovercome <[hidden email]> wrote: My flink job runs in kubernetes. This is the setup: |
Thanks for your reply!
What I have seen is that the job terminates when there's intermittent loss of connectivity with zookeeper. This is in-fact the most common reason why our jobs are terminating at this point. Worse, it's unable to restore from checkpoint during some (not all) of these terminations. Under these scenarios, won't the job try to recover from a savepoint? I've gone through various tickets reporting stability issues due to zookeeper that you've mentioned you intend to resolve soon. But until the zookeeper based HA is stable, should we assume that it will repeatedly restore from savepoints? I would rather rely on kafka offsets to resume where it left off rather than savepoints. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
What are exactly the problems when the checkpoint recovery does not work? Even if the ZooKeeper connection is temporarily disconnected which leads to the JobMaster losing leadership and the job being suspended, the next leader should continue where the first job left stopped because of the lost ZooKeeper connection. What happens under the hood when restoring from a savepoint is that it is inserted into the CompletedCheckpointStore where also the other checkpoints are stored. If now a failure happens, Flink will first try to recover from a checkpoint/savepoint from the CompletedCheckpointStore and only if this store does not contain any checkpoints/savepoints, it will use the savepoint with which the job is started. The CompletedCheckpointStore persists the checkpoint/savepoint information by writing the pointers to ZooKeeper. Cheers, Till On Mon, Dec 21, 2020 at 11:38 AM vishalovercome <[hidden email]> wrote: Thanks for your reply! |
I don't know how to reproduce it but what I've observed are three kinds of
termination when connectivity with zookeeper is somehow disrupted. I don't think its an issue with zookeeper as it supports a much bigger kafka cluster since a few years. 1. The first kind is exactly this - https://github.com/apache/flink/pull/11338. Basically temporary loss of connectivity or rolling upgrade of zookeeper will cause job to terminate. It will restart eventually from where it left off. 2. The second kind is when job terminates and restarts for the same reason but is unable to recover from checkpoint. I think its similar to this - https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0 (from 1.11.2) will fix the second issue then I'll upgrade. 3. The third kind is where it repeatedly restarts as its unable to establish a session with Zookeeper. I don't know if reducing session timeout will help here but in this case, I'm forced to disable zookeeper HA entirely as the job cannot even restart here. I could create a JIRA ticket for discussion zookeeper itself if you suggest but the issue of zookeeper and savepoints are related as I'm not sure what will happen in each of the above. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Vishal, thanks for the detailed description of the problems. 1. This is currently the intended behaviour of Flink. The reason is that if the system is no longer connected to ZooKeeper then we cannot rule out that there is another process who has taken over the leadership. FLINK-10052 has the goal to make this behaviour configurable and we intend to include it in the next major release. 2. This is indeed a bug of the newly introduced application mode. With Flink 1.11.3 or 1.12.0 it should be fixed. Hence, I would recommend you to upgrade your Flink cluster. 3. Hard to tell what the problem is here. From Flink's perspective, if it cannot establish a connection to ZooKeeper, then it cannot be sure who is the leader and whether it should start executing jobs. Maybe there is a problem with the connection to the ZooKeeper cluster from the nodes on which Flink runs. Decreasing the session timeouts usually makes the connection less stable if it is a network issue. Cheers, Till On Mon, Dec 21, 2020 at 3:53 PM vishalovercome <[hidden email]> wrote: I don't know how to reproduce it but what I've observed are three kinds of |
Free forum by Nabble | Edit this page |