zalando-stups / senza

Deploy immutable application stacks and create and execute AWS CloudFormation templates in a sane way

Home Page:https://pypi.python.org/pypi/stups-senza

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error during traffic switch causes 0% traffic for all stacks

ePaul opened this issue · comments

Background

We had two stacks, one with 100% weight, and a broken one (failed deployment) with 0% weight.
After deploying a third stack, we (our CD system) was switching traffic to it:

13:35:56.202 Running: /tools/run registry.opensource.zalan.do/stups/toolchain-stups:22 -- senza traffic purchase-orders-management.yaml 201904041320 100 --region eu-central-1
13:35:59.030 Calculating new weights.. OK
13:35:59.031 Stack Name                │Version     │Identifier                             │Old Weight%│Delta │Compensation│New Weight%│Current
13:35:59.031 purchase-orders-management              purchase-orders-management-201904031151         0.0                             0.0         
13:35:59.031 purchase-orders-management 201903281417 purchase-orders-management-201903281417       100.0 -100.0                      0.0         
13:35:59.031 purchase-orders-management 201904041320 purchase-orders-management-201904041320         0.0  100.0                    100.0 <       
13:36:01.074 Setting weights for purchase-orders-management.goodbuy.zalan.do...Validation Error: Stack:arn:aws:cloudformation:eu-central-1:383379053614:stack/purchase-orders-management-201904031151/0ecefee0-56ca-11e9-99be-026d43bbed96 is in CREATE_FAILED state and can not be updated.

So the traffic switching failed because of the broken stack. So far, so good.

Problem

But when looking at the setting later, it looked like that:

$ senza traffic purchase-orders-management
Stack Name                │Version     │Identifier                             │Weight%
purchase-orders-management              purchase-orders-management-201904031151     0.0 
purchase-orders-management 201903281417 purchase-orders-management-201903281417     0.0 
purchase-orders-management 201904041320 purchase-orders-management-201904041320     0.0 

So now all stacks (including the broken one) had a weight of 0.0. That is definitely not correct.

Guess on what happened

Looking into the code of senza traffic, it looks like the command computes the new percentages (and displays them, as we can see), and then goes through them one-by-one, issuing the API call to change the weights. As soon as one of them fails, the whole command stops.

This here seems to have the effect that first version 201903281417 is set to 0, then the broken stack is tried to update (which fails), and the setting of 201904041320 to 100 is not even tried.

What should happen

When switching the traffic, the weight-increasing of some instances should be done before decreasing the weight of other instances.