wtsi-ssg/autoscale-etcd

This repository contains a template to deploy an autoscaling service on openstack. The scale-up is handled by heat and the scale down is handled by the etcd scripts.

Before using, you should replace the configscript with your own script, see doc/writing_a_configscript.md for help
To use this repo (credentials should be valid for the deploy location and should probably be a service account):

git clone git@gitlab.internal.sanger.ac.uk/bh9/autoscale-etcd.git
cd autoscale-etcd
openstack stack create -t template.yaml\ 
--parameter OS_USERNAME=my_OS_USERNAME\  
--parameter OS_TENANT_NAME=my_OS_TENANT_NAME\
--parameter OS_PASSWORD=my_OS_PASSWORD\
--parameter net_name="my_network"\
--parameter image="Ubuntu Xenial"\
--parameter sec_grps=["cloudforms_ssh_in","internal_etcd","netdata","my_service_sec_grp"]
--parameter insance_type="m1.small"\
--parameter key_name=my_key_name\
--parameter configscript="my_remote_script_name.sh"\
my_stack_name

internal_etcd should open ports 12379 and 12380 and netdata should open 19999. These can both be open to just the private network (assuming the metrics server is in the same private network).
Other (optional) parameters:

name	default	description
OS_REGION	regionOne	the nova region it should be deployed to (delta only has regionOne)
capacity	3	The target capacity that the cluster should aim to be when load is low
scaledownperiod	200	The minimum time between scale down operations
etcdclientport	12379	The tcp port which etcd uses to handle client requests
etcdpeerport	12380	The tcp port which etcd uses to communicate internally
retries	10	The number of attemtps to join the cluster before failure
lockattemptperiod	10	The minimum time between a single host's lock acquire attempts
min_cluster	3	minimum cluster size (the point at which heat will autoreplace failed nodes). Note that the scale down scripts will only scale down to capacity, not min_cluster
max_cluster	5	The maximum size of the cluster
scaleupcooldown	240	The minimum time between scale up operations, note that ceilometer applies a minimum of 10 minutes due to it's gathering period
etcdclientscheme	http	The protocol used to serve client requests (Note: https uses auto-tls. Since VMs have low entropy, this step can take 5 minutes)
etcdpeerscheme	http	The protocol used for peer-to-peer communications (Note: https uses auto-tls. Since VMs have low entropy, this step can take 5 minutes)
proxies	0	The number of proxies in front of the etcd cluster (use proxyconfig.sh to configure them to also act as e.g. a mongos)
downmetric	NETDATA_SYSTEM_CPU_IDLE	the netdata metric to use for scaling down (e.g. `NETDATA_SYSTEM_CPU_IDLE` or `NETDATA_SYSTEM_LIAD_LOAD1` or `NETDATA_SYSTEM_IO_IN, see doc/all_metrics for more options). Currently, this is only as it would be returned, however, I plan to add rate of change
threshold	10	the scaledown threshold of the chosen metric (default is NETDATA_SYSTEM_CPU_IDLE)
comparator	'<'	the comparator between the metric and the threshold (options are '>' '<' '==' '<=' '>=')
upmetric	cpu_util	the heat metric to scale up for
metrics_server	0	whether or not a metrics server should be deployed, 1 or 0
failtolerance	20	every lockattemptperiod seconds, an etcd communication is made. If this fails, the failmarker goes up by 5, but if it succeeds, the marker goes down by 1. If this marker exceeds failtolerance, the machine is removed

Supplied with this is a pair of packer templates. Using the images produced by these templates has dropped the maximum time the cluster is down a member from ~200s to ~70s on the same hardware, however they are not compulsory. Note: to use other images, add the commands in the relevant template to the top of etcd/etcd_autoscale

Note: etcd/etcd_autoscale.sh is based on https://github.com/MonsantoCo/etcd-aws-cluster

wtsi-ssg / autoscale-etcd

About

Languages