anthroprose / operations

Chef Cookbook and Model Defined Infrastructure for Instant Single Stack Continuous Operations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

operations

Travis-CI Build Status Dependency Status

v0.2.0

Chef Cookbook for a single stack operations machine.

This cookbook and associated role & metadata are currently tuned for a (we started with a c3.large with 2 cores and 3.75G of RAM) are are now using a m3.xlarge with 4cores and 15G of RAM (ElasticSearch some extra headroom to cover large log bursts of the half mill per minute variety and statsD with node eats CPU). In production we are capable of aggregating logs, indexing and serving live analytics for approximately 40,000 Transactions Per Minute of our Web App, which can be anywhere from 3 - 6 log lines per request (NginX, uWSGI, App) (anywhere from 250,000 to 500,000 loglines per minute at peak!). Additionally, and approximately 5,000,000 (yeah, thats Millions) time series datapoints are aggregated and written every minute from diamond and statsD calls in the codebase.

No special tuning has occured, and we are using standard EBS, no PIOPs or kernel settings at this point. We're thinking about switching to https://github.com/armon/statsite or https://github.com/bitly/statsdaemon for a less CPU intensive statsD daemon (it currently uses more CPU than ElasticSearch, Carbon or Logstash).

Included is a cloudformation template which will setup a 1:1 Min/Max ASG for garunteeing uptime of the instance. All data is stored under /opt which is an EBS Mountpoint in AWS. Snapshots are taken every hour and on boot/reboot the machine checks for old snapshots to mount under /opt instead of re-installing or re-creating the drive. At most you may loose up to 1 hour of data with this setup, small gaps in graphs.

Note that when creating a new AMI, if you're using a running server, AWS will by default include the sdh volume. Do not let it, as the snapshot restore will not work properly.

Log Aggregation/Analysis

  • ElasticSearch
  • Logstash
  • Kibana
  • Rsyslog
  • Redis
  • Beaver

Time Series / Metrics

  • Graphite
  • StatsD
  • Tattle (probably going to replace with seyren)
  • Skyline (In Progress)

Continuous Integration / Delivery

  • Jenkins
  • Test Kitchen (In Progress)

Infrastructure Reporting

  • Netflix's ICE for AWS Billing Reporting

Changelog

Test Kitchen

AWS

  • Coming soon

Requirements

  • CentOS/RHEL/Amazon Linux

packages

  • rubygems - chef, and gems
  • ruby-devel - for compiling and installing gems

pip packages

  • beaver==31 - Log shipping
  • flask - lightweight web framework
  • grequests - gevent async http
  • scikits.statsmodels - stats
  • scipy - stats
  • numpy - for crunching stats
  • pandas - data structures
  • patsy - statistical models
  • statsmodels - statistical models
  • msgpack_python - serialization
  • boto - api calls

rubygems

  • chef-zero - mock all the things
  • test-kitchen - test all the things
  • kitchen-ec2 - test all the things in the cloud

chef cookbooks

  • recipe[yum] - packages
  • recipe[user] - users
  • recipe[cron] - crontabs
  • recipe[rsyslog::client] - log aggregation
  • recipe[git] - code checkout
  • recipe[chef-solo-search] - if not using chef server
  • recipe[graphite] - time series graphing
  • recipe[sudo] - users
  • recipe[redisio] - log aggregation queue
  • recipe[java] - all the things
  • recipe[maven] - for building seyren
  • recipe[postfix] - alerting
  • recipe[mysql::server] - metadata storage
  • recipe[logstash::server] - log aggregation
  • recipe[statsd] - time series data
  • recipe[elasticsearch] - log aggregation and document store
  • recipe[nginx] - http(s)
  • recipe[kibana] - log aggregation visualization
  • recipe[jenkins::server] - continuous integration/delivery
  • recipe[chatbot] - hipchat v2 api bot
  • recipe[chatbot::init] - init.d for bot

projects

to consider

  • revily - On-call scheduling and incident response

Attributes

operations::default

Key Type Description Default
['operations']['user'] String system account operations
['operations']['ssh_keys'] String Set [Array] public ssh keys for system account's autorized_keys ["","",""]
['rsyslog']['server_ip'] String syslog server for rsyslog forwarding syslog.internal.operations.com
['rsyslog']['port'] Integer syslog port for rsyslog forwarding 5544
['logstash']['server']['base_config_cookbook'] String cookbook with logstash server config template operations
['logstash']['server']['install_rabbitmq'] Boolean to install rabbitmq or not false
['logstash']['server']['xmx'] String java max ram 512M
['logstash']['server']['xms'] String java min ram 512M
['statsd']['delete_idle_stats'] Boolean delete idle stats true
['statsd']['delete_timers'] Boolean delete idle timers true
['statsd']['delete_gauges'] Boolean delete idle gauges true
['statsd']['delete_sets'] Boolean delete idle sets true
['statsd']['delete_counters'] Boolean delete idle counters true
['statsd']['flush_interval'] Integer flush interval in ms - set this the same as diamond! (1 minute here) 60000
['authorization']['sudo']['passwordless'] Boolean allow passwordless sudo true
['authorization']['sudo']['users'] String Set [Array] list of users to allow sudo access ["ec2-user", "operations"]
['postfix']['main']['smtpd_use_tls'] Boolean use tls when connecting out false
['tattle']['listen_port'] Integer port for tattle webapp 8082
['tattle']['url'] String url for alert emails to link back tattle.internal.operations.com
['tattle']['admin_email'] String email alerts are from ops@operations.com
['tattle']['doc_root'] String docroot for tattle webapp /opt/tattle
['chatbot']['rooms'] String Set [Array] list of hipchat rooms to join ["alpha", "names"]
['chatbot']['username'] String hipchat account username realname
['chatbot']['password'] String hipchat password xx
['chatbot']['nickname'] String nickname for bot eggdrop
['chatbot']['api_key'] String v2 api key md5
['nginx']['default_domain'] String default vhost listener localhost
['nginx']['default_site_enabled'] Boolean allow default docroot false
['nginx']['sites']['proxy'] String Set [Array of {Object}] snazzy nginx proxy metadata [ { "domain":"graphite.internal.operations.com", "directory":"/opt/graphite/webapp/content/", "proxy_location" : "http://localhost:8080" }, { "domain":"anthracite.internal.operations.com", "directory":"/opt/anthracite/", "proxy_location" : "http://localhost:8081" }, { "domain":"tattle.internal.operations.com", "directory":"/opt/tattle/", "proxy_location" : "http://localhost:8082" }, { "domain":"skyline.internal.operations.com", "directory":"/opt/skyline/", "proxy_location" : "http://localhost:1500" }, { "domain":"jenkins.internal.operations.com", "directory":"/opt/jenkins/", "proxy_location" : "http://localhost:8089" } ]
['nginx']['default_site_enabled'] Boolean allow default docroot false
['apache']['listen_ports'] Integer [Array] ports that apache can vhost listen on 8080
['graphite']['listen_port'] Integer graphite vhost listener 8080
['graphite']['graphite_web']['bitmap_support'] Boolean compile fancy bitmap support false
['kibana']['webserver_hostname'] String hostname for kibana kibana.internal.operations.com
['kibana']['webserver_listen'] String ip to bind to *
['elasticsearch']['allocated_memory'] String ram for elasticsearch 2048m
['elasticsearch']['version'] String version to install 0.90.11
['elasticsearch']['path']['data'] String path to data store /opt/elasticsearch/data
['elasticsearch']['path']['work'] String path to work store /opt/elasticsearch/work
['elasticsearch']['path']['logs'] String path to logs /var/log/elasticsearch
['mysql']['server_debian_password'] String another password for mysql xx
['mysql']['server_repl_password'] String root password for mysql xx
['mysql']['server_root_password'] String another password for mysql xx
['jenkins']['server']['port'] Integer port jenkins lives on 8089
['jenkins']['server']['home'] String data dir /opt/jenkins
['jenkins']['server']['url'] String url for jenkins http://jenkins.internal.operations.com

Features/Usage

operations::default

  • Include this on any node for all of the pre-reqs for log and metrics shipping
  • Just set: "rsyslog" => { "server_ip" => "syslog.internal.operations.com", "port" => "5544" }

operations::infrastructure

  • Some attributes must be overriden, not defaulted. Check the role json, we use this because of setting and over-riding attributes across a large number of cookbooks.
  • If using AWS, it self-snapshots the /opt mounted EBS once an hour by freezing the XFS filesystem, snapshotting and then thawing the drive.
  • If using AWS, it uses UserData to check for previous snapshots and loads the latest one instead of creating a new /opt mount (bounce-back servers! you loose up to 1 hour of data/gaps in graphs with this)
  • Log Aggregation/Indexing/Querying for your entire Infrastructure
  • Time Series data collection and graphing
  • Event annotation for tracking operation events such as deploys/downtime along with graphs
  • Alerting for Time Series Data
  • Jenkins for reporting on timed/cron'd operational tasks or actually used for continuous integration/delivery

Notes for Scale

  • If you are running redis 2.4.x increase the ulimit or upgrade to 2.6.x running out of file descriptors will cause 100% CPU and a non-responsive redis reference
  • node.js statsD is the highest CPU User, consider running a C version

Contributing

  1. Fork the repository on Github
  2. Create a named feature branch (like add_component_x)
  3. Write your change
  4. Write tests for your change (if applicable)
  5. Run the tests, ensuring they all pass
  6. Submit a Pull Request using Github

License and Authors

Authors: corley@avast.com - anthroprose rdickeyvii@gmail.com - rdickey

About

Chef Cookbook and Model Defined Infrastructure for Instant Single Stack Continuous Operations

License:GNU General Public License v2.0


Languages

Language:HTML 74.9%Language:Ruby 19.3%Language:Shell 5.8%