(DEPRECATED) Apache Flink User Mailing List archive.

Anyone tried to do blue-green topology deployments?

Classic

List

Threaded

4 messages Options

Stephen Connolly

Anyone tried to do blue-green topology deployments?

I have my main application updating with a blue-green deployment strategy whereby a new version (always called green) starts receiving an initial fraction of the web traffic and then - based on the error rates - we progress the % of traffic until 100% of traffic is being handled by the green version. At which point we decommission blue and green is the new blue when the next version comes along.

Applied to Flink, my initial thought is that you would run the two topologies in parallel, but the first action of each topology would be a filter based on the key.

You basically would use a consistent transformation of the key into a number between 0 and 100 and the filter would be:

(key) -> color == green ? f(key) < level : f(key) >= level

Then I can use a suitable metric to determine if the new topology is working and ramp up or down the level.

One issue I foresee is what happens if the level changes mid-window, I will have output from both topologies when the window ends.

In the case of my output, which is aggregatable, I will get the same results from two rows as from one row *provided* that the switch from blue to green is synchronized between the two topologies. That sounds like a hard problem though.

Another thought I had was to let the web front-end decide based on the same key vs level approach. Rather than submit the raw event, I would add the target topology to the event and the filter just selects based on whether it is the target topology. This has the advantage that I know each event will only ever be processed by one of green or blue. Heck I could even use the main web application's blue-green deployment to drive the flink blue green deployment as due to the way I structure my results I don't care if I get two rows of counts for a time window or one row of counts, because I'm adding up the total counts across multiple rows and sum is sum!

Anyone else had to try and deal with this type of thing?

-stephenc

Stephen Connolly

Re: Anyone tried to do blue-green topology deployments?

On Mon, 11 Feb 2019 at 13:26, Stephen Connolly <[hidden email]> wrote:

I have my main application updating with a blue-green deployment strategy whereby a new version (always called green) starts receiving an initial fraction of the web traffic and then - based on the error rates - we progress the % of traffic until 100% of traffic is being handled by the green version. At which point we decommission blue and green is the new blue when the next version comes along.

Applied to Flink, my initial thought is that you would run the two topologies in parallel, but the first action of each topology would be a filter based on the key.

You basically would use a consistent transformation of the key into a number between 0 and 100 and the filter would be:

(key) -> color == green ? f(key) < level : f(key) >= level

Then I can use a suitable metric to determine if the new topology is working and ramp up or down the level.

One issue I foresee is what happens if the level changes mid-window, I will have output from both topologies when the window ends.

In the case of my output, which is aggregatable, I will get the same results from two rows as from one row *provided* that the switch from blue to green is synchronized between the two topologies. That sounds like a hard problem though.

Another thought I had was to let the web front-end decide based on the same key vs level approach. Rather than submit the raw event, I would add the target topology to the event and the filter just selects based on whether it is the target topology. This has the advantage that I know each event will only ever be processed by one of green or blue. Heck I could even use the main web application's blue-green deployment to drive the flink blue green deployment

In other words, if a blue web node receives an event upload it adds "blue", whereas if a green web node receives an event upload it adds "green" (not quite those strings but rather the web deployment sequence number). This has the advantage that the web nodes do not need to parse the event payload. The % of web traffic will result in the matching % of events being sent to blue and green. Also this means that all keys get processed at the target % during the deployment, which can help flush out bugs.

I can therefore stop the old topology at > 1 window after the green web node started getting 100% of traffic in order to allow any existing windows in flight to flush all the way to the datastore...

Out of order events would be tagged as green once green is 100% of traffic, and so can be processed correctly...

And I can completely ignore topology migration serialization issues...

Sounding very tempting... there must be something wrong...

(or maybe my data storage plan just allows me to make this kind of optimization!)

as due to the way I structure my results I don't care if I get two rows of counts for a time window or one row of counts, because I'm adding up the total counts across multiple rows and sum is sum!

Anyone else had to try and deal with this type of thing?

-stephenc

Stephen Connolly

Re: Anyone tried to do blue-green topology deployments?

Another possibility would be injecting pseudo events into the source and having a stateful filter.

The event would be something like “key X is now owned by green”.

I can do that because getting a list of keys seen in the past X minutes is cheap (we have it already)

But it’s unclear what impact would be adding such state to the filter

On Mon 11 Feb 2019 at 13:33, Stephen Connolly <[hidden email]> wrote:

On Mon, 11 Feb 2019 at 13:26, Stephen Connolly <[hidden email]> wrote:
I have my main application updating with a blue-green deployment strategy whereby a new version (always called green) starts receiving an initial fraction of the web traffic and then - based on the error rates - we progress the % of traffic until 100% of traffic is being handled by the green version. At which point we decommission blue and green is the new blue when the next version comes along.

Applied to Flink, my initial thought is that you would run the two topologies in parallel, but the first action of each topology would be a filter based on the key.

You basically would use a consistent transformation of the key into a number between 0 and 100 and the filter would be:

(key) -> color == green ? f(key) < level : f(key) >= level

Then I can use a suitable metric to determine if the new topology is working and ramp up or down the level.

One issue I foresee is what happens if the level changes mid-window, I will have output from both topologies when the window ends.

In the case of my output, which is aggregatable, I will get the same results from two rows as from one row *provided* that the switch from blue to green is synchronized between the two topologies. That sounds like a hard problem though.

Another thought I had was to let the web front-end decide based on the same key vs level approach. Rather than submit the raw event, I would add the target topology to the event and the filter just selects based on whether it is the target topology. This has the advantage that I know each event will only ever be processed by one of green or blue. Heck I could even use the main web application's blue-green deployment to drive the flink blue green deployment

In other words, if a blue web node receives an event upload it adds "blue", whereas if a green web node receives an event upload it adds "green" (not quite those strings but rather the web deployment sequence number). This has the advantage that the web nodes do not need to parse the event payload. The % of web traffic will result in the matching % of events being sent to blue and green. Also this means that all keys get processed at the target % during the deployment, which can help flush out bugs.

I can therefore stop the old topology at > 1 window after the green web node started getting 100% of traffic in order to allow any existing windows in flight to flush all the way to the datastore...

Out of order events would be tagged as green once green is 100% of traffic, and so can be processed correctly...

And I can completely ignore topology migration serialization issues...

Sounding very tempting... there must be something wrong...

(or maybe my data storage plan just allows me to make this kind of optimization!)

as due to the way I structure my results I don't care if I get two rows of counts for a time window or one row of counts, because I'm adding up the total counts across multiple rows and sum is sum!

Anyone else had to try and deal with this type of thing?

-stephenc

Sent from my phone

Stephen Connolly

Re: Anyone tried to do blue-green topology deployments?

On Mon, 11 Feb 2019 at 14:10, Stephen Connolly <[hidden email]> wrote:

Another possibility would be injecting pseudo events into the source and having a stateful filter.

The event would be something like “key X is now owned by green”.

I can do that because getting a list of keys seen in the past X minutes is cheap (we have it already)

But it’s unclear what impact would be adding such state to the filter

Hmmm might not need to be quite so stateful, if the filter was implemented as a BroadcastProcessFunction or a KeyedBroadcastProcessFunction, I could run the key -> threshold and compare to the level from the Broadcast context... that way the broadcast events wouldn't need to be associated with any specific key and could just be {"level":56}

On Mon 11 Feb 2019 at 13:33, Stephen Connolly <[hidden email]> wrote:

On Mon, 11 Feb 2019 at 13:26, Stephen Connolly <[hidden email]> wrote:
I have my main application updating with a blue-green deployment strategy whereby a new version (always called green) starts receiving an initial fraction of the web traffic and then - based on the error rates - we progress the % of traffic until 100% of traffic is being handled by the green version. At which point we decommission blue and green is the new blue when the next version comes along.

Applied to Flink, my initial thought is that you would run the two topologies in parallel, but the first action of each topology would be a filter based on the key.

You basically would use a consistent transformation of the key into a number between 0 and 100 and the filter would be:

(key) -> color == green ? f(key) < level : f(key) >= level

Then I can use a suitable metric to determine if the new topology is working and ramp up or down the level.

One issue I foresee is what happens if the level changes mid-window, I will have output from both topologies when the window ends.

In the case of my output, which is aggregatable, I will get the same results from two rows as from one row *provided* that the switch from blue to green is synchronized between the two topologies. That sounds like a hard problem though.

Another thought I had was to let the web front-end decide based on the same key vs level approach. Rather than submit the raw event, I would add the target topology to the event and the filter just selects based on whether it is the target topology. This has the advantage that I know each event will only ever be processed by one of green or blue. Heck I could even use the main web application's blue-green deployment to drive the flink blue green deployment

In other words, if a blue web node receives an event upload it adds "blue", whereas if a green web node receives an event upload it adds "green" (not quite those strings but rather the web deployment sequence number). This has the advantage that the web nodes do not need to parse the event payload. The % of web traffic will result in the matching % of events being sent to blue and green. Also this means that all keys get processed at the target % during the deployment, which can help flush out bugs.

I can therefore stop the old topology at > 1 window after the green web node started getting 100% of traffic in order to allow any existing windows in flight to flush all the way to the datastore...

Out of order events would be tagged as green once green is 100% of traffic, and so can be processed correctly...

And I can completely ignore topology migration serialization issues...

Sounding very tempting... there must be something wrong...

(or maybe my data storage plan just allows me to make this kind of optimization!)

as due to the way I structure my results I don't care if I get two rows of counts for a time window or one row of counts, because I'm adding up the total counts across multiple rows and sum is sum!

Anyone else had to try and deal with this type of thing?

-stephenc

--
Sent from my phone