(DEPRECATED) Apache Flink User Mailing List archive.

long lived standalone job session cluster in kubernetes

Classic

List

Threaded

13 messages Options

Derek VerLee

long lived standalone job session cluster in kubernetes

I'm looking at the job cluster mode, it looks great and I and considering migrating our jobs off our "legacy" session cluster and into Kubernetes.

I do need to ask some questions because I haven't found a lot of details in the documentation about how it works yet, and I gave up following the the DI around in the code after a while.

Let's say I have a deployment for the job "leader" in HA with ZK, and another deployment for the taskmanagers.

I want to upgrade the code or configuration and start from a savepoint, in an automated way.

Best I can figure, I can not just update the deployment resources in kubernetes and allow the containers to restart in an arbitrary order.

Instead, I expect sequencing is important, something along the lines of this:

1. issue savepoint command on leader
2. wait for savepoint
3. destroy all leader and taskmanager containers
4. deploy new leader, with savepoint url
5. deploy new taskmanagers

For example, I imagine old taskmanagers (with an old version of my job) attaching to the new leader and causing a problem.

Does that sound right, or am I overthinking it?

If not, has anyone tried implementing any automation for this yet?

Dawid Wysakowicz-2

Re: long lived standalone job session cluster in kubernetes

Hi Derek,

I am not an expert in kubernetes, so I will cc Till, who should be able
to help you more.

As for the automation for similar process I would recommend having a
look at dA platform[1] which is built on top of kubernetes.

Best,

Dawid

[1] https://data-artisans.com/platform-overview

On 30/11/2018 02:10, Derek VerLee wrote:

>
> I'm looking at the job cluster mode, it looks great and I and
> considering migrating our jobs off our "legacy" session cluster and
> into Kubernetes.
>
> I do need to ask some questions because I haven't found a lot of
> details in the documentation about how it works yet, and I gave up
> following the the DI around in the code after a while.
>
> Let's say I have a deployment for the job "leader" in HA with ZK, and
> another deployment for the taskmanagers.
>
> I want to upgrade the code or configuration and start from a
> savepoint, in an automated way.
>
> Best I can figure, I can not just update the deployment resources in
> kubernetes and allow the containers to restart in an arbitrary order.
>
> Instead, I expect sequencing is important, something along the lines
> of this:
>
> 1. issue savepoint command on leader
> 2. wait for savepoint
> 3. destroy all leader and taskmanager containers
> 4. deploy new leader, with savepoint url
> 5. deploy new taskmanagers
>
>
> For example, I imagine old taskmanagers (with an old version of my
> job) attaching to the new leader and causing a problem.
>
> Does that sound right, or am I overthinking it?
>
> If not, has anyone tried implementing any automation for this yet?
>

signature.asc (849 bytes) Download Attachment

Andrey Zagrebin

Re: long lived standalone job session cluster in kubernetes

Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Till Rohrmann

Re: long lived standalone job session cluster in kubernetes

Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,

Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:

Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Derek VerLee

Re: long lived standalone job session cluster in kubernetes

Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.

On 12/4/18 5:35 AM, Till Rohrmann wrote:

Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,

Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:

Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Till Rohrmann

Re: long lived standalone job session cluster in kubernetes

Hi Derek,

there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.

[1] https://issues.apache.org/jira/browse/FLINK-9953

Cheers,

Till

On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:

Sounds good.

Is someone working on this automation today?

If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.

On 12/4/18 5:35 AM, Till Rohrmann wrote:

Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,

Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:

Hi Derek,

I think your automation steps look good.
Recreating deployments should not take long
and as you mention, this way you can avoid unpredictable old/new version collisions.

Best,
Andrey

> On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi Derek,
>
> I am not an expert in kubernetes, so I will cc Till, who should be able
> to help you more.
>
> As for the automation for similar process I would recommend having a
> look at dA platform[1] which is built on top of kubernetes.
>
> Best,
>
> Dawid
>
> [1] https://data-artisans.com/platform-overview
>
> On 30/11/2018 02:10, Derek VerLee wrote:
>>
>> I'm looking at the job cluster mode, it looks great and I and
>> considering migrating our jobs off our "legacy" session cluster and
>> into Kubernetes.
>>
>> I do need to ask some questions because I haven't found a lot of
>> details in the documentation about how it works yet, and I gave up
>> following the the DI around in the code after a while.
>>
>> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> another deployment for the taskmanagers.
>>
>> I want to upgrade the code or configuration and start from a
>> savepoint, in an automated way.
>>
>> Best I can figure, I can not just update the deployment resources in
>> kubernetes and allow the containers to restart in an arbitrary order.
>>
>> Instead, I expect sequencing is important, something along the lines
>> of this:
>>
>> 1. issue savepoint command on leader
>> 2. wait for savepoint
>> 3. destroy all leader and taskmanager containers
>> 4. deploy new leader, with savepoint url
>> 5. deploy new taskmanagers
>>
>>
>> For example, I imagine old taskmanagers (with an old version of my
>> job) attaching to the new leader and causing a problem.
>>
>> Does that sound right, or am I overthinking it?
>>
>> If not, has anyone tried implementing any automation for this yet?
>>
>

Heath Albritton

Re: long lived standalone job session cluster in kubernetes

Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:

>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Till Rohrmann

Re: long lived standalone job session cluster in kubernetes

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,

Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:

Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Heath Albritton

Re: long lived standalone job session cluster in kubernetes

My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

On Feb 11, 2019, at 00:26, Till Rohrmann <[hidden email]> wrote:

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:
Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Till Rohrmann

Re: long lived standalone job session cluster in kubernetes

Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-)

Cheers,

Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <[hidden email]> wrote:

My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

On Feb 11, 2019, at 00:26, Till Rohrmann <[hidden email]> wrote:

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:
Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Chunhui Shi

Re: long lived standalone job session cluster in kubernetes

Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks.

On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <[hidden email]> wrote:

Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-)

Cheers,
Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <[hidden email]> wrote:
My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

On Feb 11, 2019, at 00:26, Till Rohrmann <[hidden email]> wrote:

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:
Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Heath Albritton

Re: long lived standalone job session cluster in kubernetes

Great, my team is eager to get started. I’m curious what progress had been made so far?

-H

On Feb 26, 2019, at 14:43, Chunhui Shi <[hidden email]> wrote:

Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks.

On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <[hidden email]> wrote:
Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-)

Cheers,
Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <[hidden email]> wrote:
My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

On Feb 11, 2019, at 00:26, Till Rohrmann <[hidden email]> wrote:

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:
Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Till Rohrmann

Re: long lived standalone job session cluster in kubernetes

Hi Heath,

I think some of the PRs are already open and ready for review [1, 2].

[1] https://issues.apache.org/jira/browse/FLINK-10932

[2] https://issues.apache.org/jira/browse/FLINK-10935

Cheers,

Till

On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton <[hidden email]> wrote:

Great, my team is eager to get started. I’m curious what progress had been made so far?

-H

On Feb 26, 2019, at 14:43, Chunhui Shi <[hidden email]> wrote:

Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks.

On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <[hidden email]> wrote:
Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-)

Cheers,
Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <[hidden email]> wrote:
My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

On Feb 11, 2019, at 00:26, Till Rohrmann <[hidden email]> wrote:

Hi Heath,

I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <[hidden email]> wrote:
Has any progress been made on this? There are a number of folks in
the community looking to help out.

-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <[hidden email]> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <[hidden email]> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <[hidden email]> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <[hidden email]> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>