I'm looking at the job cluster mode, it looks great and I and
considering migrating our jobs off our "legacy" session cluster
and into Kubernetes. I do need to ask some questions because I haven't found a lot of
details in the documentation about how it works yet, and I gave up
following the the DI around in the code after a while. Let's say I have a deployment for the job "leader" in HA with ZK, and another deployment for the taskmanagers. I want to upgrade the code or configuration and start from a
savepoint, in an automated way. Best I can figure, I can not just update the deployment resources in kubernetes and allow the containers to restart in an arbitrary order. Instead, I expect sequencing is important, something along the lines of this: 1. issue savepoint command on leader
For example, I imagine old taskmanagers (with an old version of my job) attaching to the new leader and causing a problem. Does that sound right, or am I overthinking it? If not, has anyone tried implementing any automation for this yet? |
Hi Derek,
I am not an expert in kubernetes, so I will cc Till, who should be able to help you more. As for the automation for similar process I would recommend having a look at dA platform[1] which is built on top of kubernetes. Best, Dawid [1] https://data-artisans.com/platform-overview On 30/11/2018 02:10, Derek VerLee wrote: > > I'm looking at the job cluster mode, it looks great and I and > considering migrating our jobs off our "legacy" session cluster and > into Kubernetes. > > I do need to ask some questions because I haven't found a lot of > details in the documentation about how it works yet, and I gave up > following the the DI around in the code after a while. > > Let's say I have a deployment for the job "leader" in HA with ZK, and > another deployment for the taskmanagers. > > I want to upgrade the code or configuration and start from a > savepoint, in an automated way. > > Best I can figure, I can not just update the deployment resources in > kubernetes and allow the containers to restart in an arbitrary order. > > Instead, I expect sequencing is important, something along the lines > of this: > > 1. issue savepoint command on leader > 2. wait for savepoint > 3. destroy all leader and taskmanager containers > 4. deploy new leader, with savepoint url > 5. deploy new taskmanagers > > > For example, I imagine old taskmanagers (with an old version of my > job) attaching to the new leader and causing a problem. > > Does that sound right, or am I overthinking it? > > If not, has anyone tried implementing any automation for this yet? > signature.asc (849 bytes) Download Attachment |
Hi Derek,
I think your automation steps look good. Recreating deployments should not take long and as you mention, this way you can avoid unpredictable old/new version collisions. Best, Andrey > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote: > > Hi Derek, > > I am not an expert in kubernetes, so I will cc Till, who should be able > to help you more. > > As for the automation for similar process I would recommend having a > look at dA platform[1] which is built on top of kubernetes. > > Best, > > Dawid > > [1] https://data-artisans.com/platform-overview > > On 30/11/2018 02:10, Derek VerLee wrote: >> >> I'm looking at the job cluster mode, it looks great and I and >> considering migrating our jobs off our "legacy" session cluster and >> into Kubernetes. >> >> I do need to ask some questions because I haven't found a lot of >> details in the documentation about how it works yet, and I gave up >> following the the DI around in the code after a while. >> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> another deployment for the taskmanagers. >> >> I want to upgrade the code or configuration and start from a >> savepoint, in an automated way. >> >> Best I can figure, I can not just update the deployment resources in >> kubernetes and allow the containers to restart in an arbitrary order. >> >> Instead, I expect sequencing is important, something along the lines >> of this: >> >> 1. issue savepoint command on leader >> 2. wait for savepoint >> 3. destroy all leader and taskmanager containers >> 4. deploy new leader, with savepoint url >> 5. deploy new taskmanagers >> >> >> For example, I imagine old taskmanagers (with an old version of my >> job) attaching to the new leader and causing a problem. >> >> Does that sound right, or am I overthinking it? >> >> If not, has anyone tried implementing any automation for this yet? >> > |
Hi Derek, what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from. Cheers, Till On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote: Hi Derek, |
Sounds good. Is someone working on this automation today? If not, although my time is tight, I may be able to work on a PR
for getting us started down the path Kubernetes native cluster
mode.
On 12/4/18 5:35 AM, Till Rohrmann
wrote:
|
Hi Derek, there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out. Cheers, Till On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
|
Has any progress been made on this? There are a number of folks in
the community looking to help out. -H On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote: > > Hi Derek, > > there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out. > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > Cheers, > Till > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote: >> >> Sounds good. >> >> Is someone working on this automation today? >> >> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode. >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> Hi Derek, >> >> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from. >> >> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> Cheers, >> Till >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote: >>> >>> Hi Derek, >>> >>> I think your automation steps look good. >>> Recreating deployments should not take long >>> and as you mention, this way you can avoid unpredictable old/new version collisions. >>> >>> Best, >>> Andrey >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote: >>> > >>> > Hi Derek, >>> > >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >>> > to help you more. >>> > >>> > As for the automation for similar process I would recommend having a >>> > look at dA platform[1] which is built on top of kubernetes. >>> > >>> > Best, >>> > >>> > Dawid >>> > >>> > [1] https://data-artisans.com/platform-overview >>> > >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >> considering migrating our jobs off our "legacy" session cluster and >>> >> into Kubernetes. >>> >> >>> >> I do need to ask some questions because I haven't found a lot of >>> >> details in the documentation about how it works yet, and I gave up >>> >> following the the DI around in the code after a while. >>> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >>> >> another deployment for the taskmanagers. >>> >> >>> >> I want to upgrade the code or configuration and start from a >>> >> savepoint, in an automated way. >>> >> >>> >> Best I can figure, I can not just update the deployment resources in >>> >> kubernetes and allow the containers to restart in an arbitrary order. >>> >> >>> >> Instead, I expect sequencing is important, something along the lines >>> >> of this: >>> >> >>> >> 1. issue savepoint command on leader >>> >> 2. wait for savepoint >>> >> 3. destroy all leader and taskmanager containers >>> >> 4. deploy new leader, with savepoint url >>> >> 5. deploy new taskmanagers >>> >> >>> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >>> >> job) attaching to the new leader and causing a problem. >>> >> >>> >> Does that sound right, or am I overthinking it? >>> >> >>> >> If not, has anyone tried implementing any automation for this yet? >>> >> >>> > >>> |
Hi Heath, I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened. Cheers, Till On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote: Has any progress been made on this? There are a number of folks in |
My team and I are keen to help out with testing and review as soon as there is a pill request. -H
|
Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-) Cheers, Till On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <[hidden email]> wrote:
|
Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks. On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <[hidden email]> wrote:
|
Great, my team is eager to get started. I’m curious what progress had been made so far? -H
|
Hi Heath, I think some of the PRs are already open and ready for review [1, 2]. Cheers, Till On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |