Hi, we are attempting to migrate our flink cluster to K8, and are looking
into options how to automate job upgrades; wondering if anyone here has done it with init container? Or if there is a simpler way? 0: So, let's assume we have a job manager with few task managers running, in a stateful set; managed with helm. 1: New helm chart is published, and helm attempts the upgrade. Since it's a stateful set, new version of job manager and taskmanager is started even while old one is still running. 2: In the job manager pod, there is an init container, whose purpose it to find currently running job manager with previous version of JOB ( either via zookeeper or kubernetes service which points to currently running job manager). After it finds it, it runs cancel with savepoint using flink CLI, and passes the savepoint URL via volume to main container. 3: job manager container starts, it finds the savepoint, and restores the new version of job, with the state from savepoint. 4: new pods are passing healthchecks, so old pods are destroyed by kubernetes. What happens if there is no previous job manager running? init container sees that, and just exits without any other work. Caveat: Most of solutions I noticed were using operators, which feel quite a bit more complex, yet since I haven't found any solution using init container, I'm guessing I'm missing something, just can't figure out what? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Barisa, it seems that there is no immediate answer to your concrete question here, so I wanted to ask you back a more general question: did you consider using the Community Edition of Ververica Platform for your purposes [1]? It comes with a complete lifecycle management for Flink jobs on K8S. It also exposes a full REST API for integrating into CI/CD workflows, so if you do not need the UI, you can just ignore it. Community Edition is permanently free for commercial use at any scale. I see that you are already using Helm, so installation could be very straightforward [2]. Here is the documentation with a bit more comprehensive "Getting started" guide [3]. Best regards, -- Alexander Fedulov | Solutions Architect +49 1514 6265796 Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH On Wed, Apr 29, 2020 at 5:32 PM Barisa Obradovic <[hidden email]> wrote: Hi, we are attempting to migrate our flink cluster to K8, and are looking |
Hi Barisa, from what you've described I believe it could work. But I never tried it out. Maybe you could report back once you tried it out. I believe it would be interesting to hear your experience with this approach. One thing to note is that the approach hinges on the fact that the older JobManager is still running. If for whatever reason the old JobManager fails shortly before the new one comes up, then you might not execute the job you want to upgrade. You could mitigate the problem by using externalized checkpoints [1] but then you would fall back to an earlier point. Cheers, Till On Thu, Apr 30, 2020 at 3:38 PM Alexander Fedulov <[hidden email]> wrote:
|
Thnx all: 1) for now, we will try with inhouse kubernetes, and see how it goes. 2) Till, cheers, I'll give a stab, though likely I'll end up with an operator or some other workflow tool ( I've gotten multiple weird looks when I mentioned init container approach at work; I was mostly curios at this point if I can see what's so obviously wrong with init container approach ). Regarding: What is job manager is down? Well, in that case, or if job manager is restarting ( so I can't create a savepoint anyway); only way to upgrade would be, as you suggested, externalised checkpoints. Otherwise, only option would be to wait to start upgrade, until job manager stops restarting ( if it's an external dependency that is causing it), and resume from the checkpoint. The complexity of job manager being in restarting mode, is something I'd prefer not to handle in init container; afaik, if job is restarting, we shouldn't even try to do upgrade ( or, we could, if we are okey with losing the state). Operator sounds much more sane way to handle this. On Thu, 30 Apr 2020 at 17:59, Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |