- Foreword
- Introduction
- Prerequisites
- How does this implementation work?
- Installation and swarm setup
- I - Docker Installation
- II - Swarm Setup
- III - Services' Deployment
- Control of the System
- Updating our app
- Checking the health of the service
- Determining the LIVE instance
- Switching between LIVE and IDLE instances
- Draining a node
- Disposing of the services
- Research Material
- Afterthoughts
Just over a week ago, a challenge crossed my desk. It read: Thou shall find a way to implement thy app with no downtime nor pain for thy user, and thou shall solve it within a week!
I thought huh, how hard can it be?
And as usual, when such a question is asked, Murphy came to my place to stay for a few days. Because you know, that is how these things are.
And so this odyssey began...
This repository is about the journey of a sysadmin/developer looking for a way to solve such challenge, having never done this before.
In the past, most deployment methods required the site or app to go into maintenance mode or altogether offline in order to get updated. This used to be inconvenient for the user, as he can't use our service during such window.
Historically, such maintenance windows used to be performed during off-hours, but as services become more mission-critical or are provided across the globe, such practices become less and less acceptable, as there are no off-hours anymore.
Also, if something went wrong, a rollback might require the old version to be re-uploaded or restored from an out-of-site backup, adding up to the total downtime and also to the unhappiness of the userbase.
So, having considered that, I began my research into the matter, and after some time, I came across Mr. Martin Fowler's blog. It seems he has had the exact same problem I was presented with. Namely, how to perform Continuous Delivery without having to put the site on maintenance mode, or stopping the service altogether while doing so.
His approach was quite simple, yet elegant:
By having two production machines running the exact same version of our app, only one of them being live at any given moment handling all user requests and traffic while the other is idle, the new version of our app could be deployed to the idle machine, which is accessible on a different address and/or port, without interrupting the users from accessing our service.
Whenever it was required that the new version went LIVE, this could be easily accomplished by just redirecting traffic from the live to the idle machine while allowing all active requests to gracefully finish. That way, the moment the switch is made, all new requests are handled by the now LIVE machine, while the now IDLE one is in the process of being drained.
The technique is called Blue/Green Deployment (also Red/Black or A/B), and in this article, we are going to implement it using docker swarm.
In order to proceed with these instructions, you will first need to have:
Three networked VM's
- Manager1, our swarm controller
- Worker1, our first swarm worker
- Worker2, our second swarm worker
The following ports must be available. On some systems, these ports are open by default
- TCP port 2377 for cluster management communications
- TCP and UDP port 7946 for communication among nodes
- UDP port 4789 for overlay network traffic
- TCP ports 3000 and 3001 for connecting our services
- TCP ports 80 and 8080 on manager1
The VM's require the following software installed
- Ubuntu 16.04 xenial, 64-bit on all machines
- Docker Engine 1.12 or newer on all machines (I'm using the latest version)
- Nginx 1.10 or newer on manager1
Also, we need to assign fixed IP addresses to those machines. In this article, I will be using 192.168.2.11 for the manager/proxy, 192.168.2.12 for the first worker and 192.168.2.13 for the second worker.
Our local machine will require the latest version of Docker installed, and also an ssh client of your choice. I'm currently using vanilla Ubuntu 16.04 xenial 64bit with Docker version 17.03.0-ce.
Using docker swarm, we will create a swarm consisting of a manager and two workers. On it, we will be running two services listening on different ports. These will be our Blue and Green. Each service will be running identical versions of our app.
The swarm will be behind an nginx proxy which will be exposed to the internet. This proxy will determine which service is LIVE and which is IDLE, and will route all incoming traffic on port 80 to our LIVE serice, and all traffic on port 8080 to the IDLE service.
Whenever we need to update our software, we will do so on the IDLE service, while leaving the LIVE one deal with users.
Once app testing is complete, and the newly updated version is deemed fit for production, the proxy will be commanded to graciously switch traffic from LIVE to IDLE, swapping their roles. This way, those requests that are in progress will be allowed to finish, while new requests will be directed to the now LIVE service without any kind of downtime, hence performing updates without downtime.
First, we are going to install docker and its associated components. For this, run the docker-install.sh script (located on the setup folder) on manager1, worker1 and worker2. This script executes the following commands:
NOTE: Remember this was made for Ubuntu Xenial. If you are running a different version on the VM's, or not using linux altogether, this script won't work for you!
# Add repository keys
apt-key adv \
--keyserver hkp://p80.pool.sks-keyservers.net:80 \
--recv-keys 58118E89F3A912897C070ADBF76221572C52609D
# Add repository
apt-add-repository 'deb https://apt.dockerproject.org/repo ubuntu-xenial main'
# Update package lists
apt-get update
# Install docker engine (will install docker too)
apt-get install -y docker-engine
# Installation of Docker Machine
curl -L https://github.com/docker/machine/releases/download/v0.10.0/docker-machine-`uname -s`-`uname -m` >/tmp/docker-machine &&
chmod +x /tmp/docker-machine &&
cp /tmp/docker-machine /usr/local/bin/docker-machine
# Appending docker to proper user group, so we can run it without sudo
usermod -aG docker $(whoami)
NOTE: As usergoups are enumerated at login, you will need to log out and back in for this last command to take effect. There are ways to do it without re-logging, but it's dirty, so we won't go that way unless it's really necessary
Connect to the Manager node by SSH, and once logged in, install nginx:
sudo apt-get install nginx
Once installed, create a file named "/var/live" and write "blue" to it:
echo "blue" > /var/live
Whenever we start the VM or toggle between blue and green, this will tell us which service is currently LIVE. It will also be used by consul in order to write the nginx config file.
Connect to the Manager node by SSH, and once logged in, install consul-template by running the consul-install.sh script (located on the setup folder). This script executes the following commands:
wget https://releases.hashicorp.com/consul-template/0.18.0/consul-template_0.18.0_linux_amd64.zip
unzip consul-template_0.18.0_linux_amd64.zip -d /usr/local/bin
chmod +x /usr/local/bin/consul-template
mkdir /templates
Once done, open default.ctmpl and replace the MANAGER-IP-ADDRESS placeholders with the IP address of your manager node, then copy this file to /templates
In order to set nginx to its initial value, run the toggle.sh_ script with blue as a parameter. This will command consul to write the new nginx configuration, and then reload nginx so the changes take effect.
Connect to the Manager node by SSH, and once logged in, run the following command:
docker swarm init --advertise-addr <MANAGER-IP-ADDRESS>
If successfully executed, you will get the following output:
Swarm initialized: current node () is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token <TOKEN> \
<MANGER-IP-ADDRESS>:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
Save the first command for later use, as we will be using it on each of the worker nodes.
Exit the Manager, SSH into the first worker node, and run the command we saved earlier:
docker swarm join \
--token <TOKEN> \
<MANGER-IP-ADDRESS>:2377
NOTE: The placeholders will instead be the token and IP address of the manager node.
After a few seconds, you will get the following message, which means the node was added to our swarm successfully:
This node joined a swarm as a worker.
Repeat this step on the second worker node
SSH into the machine where the manager node runs and run the following command to see the worker nodes' status:
docker node ls
You will see output similar to the following, which means our swarm is running as expected:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
03g1y59jwfg7cf99w4lt0f662 worker2 Ready Active
9j68exjopxe7wfl6yuxml7a7j worker1 Ready Active
dxn1zf6l61qsb1josjja83ngz * manager1 Ready Active Leader
Now that our swarm is up and running, we will Deploy our services. For the pruposes of this article we will be using the following app, which can be found here:
- GitHub source repo - https://github.com/Korrd/challenge-bg-scroll
- Docker hub image - https://hub.docker.com/r/korrd2/challenge-bg-scroll/
SSH into the Manager node, and run the following commands:
docker service create --workdir="/app" --publish 3000:3000 --name="blue-service" korrd2/challenge-bg-scroll:1.0.3
docker service create --workdir="/app" --publish 3001:3000 --name="green-service" korrd2/challenge-bg-scroll:1.0.4
This will create two services, each running identical copies of our app.
Run the following command in order to see its status:
docker service ls
You will get output similar to the following, which means our services are running as expected:
ID NAME MODE REPLICAS IMAGE
6dkch7ddr3mz blue-service replicated 1/1 korrd2/challenge-bg-scroll:1.0.4
84f44rcqmaqb green-service replicated 1/1 korrd2/challenge-bg-scroll:1.0.3
At this point, if you navigate to http://MANGER-IP-ADDRESS or http://MANGER-IP-ADDRESS:8080, you will see both the LIVE and IDLE services up and running.
Once up and running, controlling our system is quite easy. Scripts are provided by this solution that take care of it with a single command. I am going to enumerate and explain them one by one.
NOTE: These scripts are meant to be run on the manager node.
In order to update our app, run either deploy-blue.sh or deploy-green.sh using the image tag as parameter.
NOTE: Both scripts will first check whether you are deploying to the LIVE or the IDLE service. If deployment to LIVE is detected, the script will abort.
In order to scale our services run either scale-blue.sh or scale-green.sh, to change the desired state of the service running in the swarm. These scripts take a number as a parameter, which will determine its size.
Run the following command to see the updated task list:
docker service ps <SERVICE-ID>
For this, we have the healthcheck.sh script. It checks whether our services are working, doing poorly, or not working at all. It prints their status on screen and also writes it down to a logfile located on /var/log/challenge.
NOTE: This health check is quite rudimentary, and was written only for the purpose of demonstrating how to check our services are working. It will fail should the internet connection between you and the host were interrupted. Response time can also be affected by your network traffic.
A more robust check would involve a series of services running on each container, proxy, and outside our swarm.
- The one running on the container would send an all-is-well signal at an interval either to a #Slack channel, email account or destination of your choice. Call it a dead-man switch. Whenever the signal stops coming, we know there is a problem with that container.
- The one running on the proxy would check whether BLUE and GREEN are responding, and report any failure to one of the channels. It too should also send its own all-is-well signal so we know if it dies.
- The one running outside the cluster would check against it and inform us in case of a timeout. (This one would be similar to our current script).
That way, we can be sure that
- a) Whenever we have a failure anywhere on our swarm, it would let us know.
- b) If one of the health check services were to fail, the all-is-well signal would stop, hence it would not fail silently (like what happened to the people at GitLab with their backup system).
In order to do so, we just run the get-live-service.sh script. It will return the value of the LIVE instance.
NOTE: If the instance is undefined, it will issue a warning.
The toggle.sh script takes care of toggling LIVE and IDLE. Those will Force the proxy to set the live environment to their respective instances.
NOTE: If the instance is undefined, it will issue a warning.
Nodes can be drained with the following command:
docker node update --availability drain <NODE-ID>
where NODE-ID is the name of the node we want to drain (eg. worker1, worker2, etc)
I wrote no script for such action, but it can be done after stopping the swarm by executing the following commands on the manager node:
docker service rm blue-service
docker service rm green-service
-
BlueGreenDeployment, Martin Fowler - https://martinfowler.com/bliki/BlueGreenDeployment.html
-
Using Blue-Green Deployment to Reduce Downtime and Risk - https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html
-
B/G deployment - https://lostechies.com/gabrielschenker/2016/05/23/blue-green-deployment/
-
B/G deployment, using containers - https://blog.tutum.co/2015/06/08/blue-green-deployment-using-containers/
-
Blue/Green deployment with HAproxy - https://www.reddit.com/r/sysadmin/comments/53h919/blue_green_deployment_with_haproxy/
-
Docker flow - https://technologyconversations.com/2016/04/18/docker-flow/
-
Research into doing it with Jenkins - https://technologyconversations.com/2015/12/08/blue-green-deployment-to-docker-swarm-with-jenkins-workflow-plugin/
-
Best practices for writing Dockerfiles - https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
-
Installing Docker Machine - https://docs.docker.com/machine/install-machine/#installing-machine-directly
-
Docker in production environments - https://docs.docker.com/compose/production/
-
Zero downtime deployment with docker - https://medium.com/@korolvs/zero-downtime-deployment-with-docker-d9ef54e48c4#.kz6wgafyu
-
Another Zero downtime article - https://www.perimeterx.com/blog/zero-downtime-deployment-with-docker/
-
Bluegreen with haproxy - https://github.com/docker/dockercloud-haproxy
-
Docker compose overview - https://docs.docker.com/compose/overview/
-
How to create a swarm - https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/
-
Deployment strategies (High Availability) - https://docs.docker.com/docker-cloud/infrastructure/deployment-strategies/
-
Dockerfile reference - https://docs.docker.com/engine/reference/builder/
-
Automated Builds - https://docs.docker.com/docker-cloud/builds/automated-build/
-
Reload a Linux users-group assignments without logging out - https://superuser.com/questions/272061/reload-a-linux-users-group-assignments-without-logging-out
-
GitLab postmortem, or as I like to call it: "Criminal Negligence: how all went to hell because we didn't check that our backup mechanisms were working properly, nor we cared to run a simulacrum of our disaster recovery procedure in order to determine if it did work as intended" - https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
-
Pushing to a git remote - https://help.github.com/articles/pushing-to-a-remote/
-
Removing leading zeroes from a variable in bash - https://coderwall.com/p/cobcna/bash-removing-leading-zeroes-from-a-variable
-
Curl get response time - https://viewsby.wordpress.com/2013/01/07/get-response-time-with-curl/
-
How to dockerize a nodejs app - https://nodejs.org/en/docs/guides/nodejs-docker-webapp/
-
Bash seems not to like floating point numbers... - http://stackoverflow.com/questions/19597962/bash-illegal-number
-
Swarm mode overview - https://docs.docker.com/engine/swarm/
-
Swarm mode key concepts - https://docs.docker.com/engine/swarm/key-concepts/
-
Swarm mode tutorial - https://docs.docker.com/engine/swarm/swarm-tutorial/
-
Blue green with swarm and Jenkins - https://technologyconversations.com/2015/07/02/scaling-to-infinity-with-docker-swarm-docker-compose-and-consul-part-34-blue-green-deployment-automation-and-self-healing-procedure/
-
Load-balancing with nginx - https://www.nginx.com/resources/admin-guide/load-balancer/
-
Nginx reload with no downtime - http://serverfault.com/questions/378581/nginx-config-reload-without-downtime
As you can see, Deployment is now way easier using the blue/green technique. Also almost totally painless and very fast.
Add some Ansible/Jenkins magic to this solution, and you can have a fully automated CI/CD pipeline running in no time. Nowadays, the path to the MVP is almost a joyride.
On a more personal note, this was quite the journey. Before starting, I saw myself as a knowledgeable sysop, yet diving into this I discovered a whole new world I wasn't previously aware of. I'm already thinking on ways to improve this (quite basic) solution further.
- Write a script to automate installation and setup of the swarm
- Improve control scripts to consider the result of each command before proceeding to the next one
- Add a 3rd service, where the latest version of the app would be autodeployed for testing purposes whenever someone does a git push.
Added control scripts and finished testing the solution. Also wrote the remaining entries on the readme file.
The solution got modified to run as a swarm using docker swarm. It is now much more simple and less error-prone than the previous version, and also uses the latest technology offered by Docker as of this date.