Chef Cookbook for a single stack operations machine.
This cookbook and associated role & metadata are currently tuned for a (we started with a c3.large with 2 cores and 3.75G of RAM) are are now using a m3.xlarge with 4cores and 15G of RAM (ElasticSearch some extra headroom to cover large log bursts of the half mill per minute variety and statsD with node eats CPU). In production we are capable of aggregating logs, indexing and serving live analytics for approximately 40,000 Transactions Per Minute of our Web App, which can be anywhere from 3 - 6 log lines per request (NginX, uWSGI, App) (anywhere from 250,000 to 500,000 loglines per minute at peak!). Additionally, and approximately 5,000,000 (yeah, thats Millions) time series datapoints are aggregated and written every minute from diamond and statsD calls in the codebase.
No special tuning has occured, and we are using standard EBS, no PIOPs or kernel settings at this point. We're thinking about switching to https://github.com/armon/statsite or https://github.com/bitly/statsdaemon for a less CPU intensive statsD daemon (it currently uses more CPU than ElasticSearch, Carbon or Logstash).
Included is a cloudformation template which will setup a 1:1 Min/Max ASG for garunteeing uptime of the instance. All data is stored under /opt which is an EBS Mountpoint in AWS. Snapshots are taken every hour and on boot/reboot the machine checks for old snapshots to mount under /opt instead of re-installing or re-creating the drive. At most you may loose up to 1 hour of data with this setup, small gaps in graphs.
Note that when creating a new AMI, if you're using a running server, AWS will by default include the sdh volume. Do not let it, as the snapshot restore will not work properly.
- ElasticSearch
- Logstash
- Kibana
- Rsyslog
- Redis
- Beaver
- Graphite
- StatsD
- Tattle (probably going to replace with seyren)
- Skyline (In Progress)
- Jenkins
- Test Kitchen (In Progress)
- Netflix's ICE for AWS Billing Reporting
- Read the Changelog
- Read the Test Kitchen Readme
- Coming soon
- CentOS/RHEL/Amazon Linux
rubygems
- chef, and gemsruby-devel
- for compiling and installing gems
beaver==31
- Log shippingflask
- lightweight web frameworkgrequests
- gevent async httpscikits.statsmodels
- statsscipy
- statsnumpy
- for crunching statspandas
- data structurespatsy
- statistical modelsstatsmodels
- statistical modelsmsgpack_python
- serializationboto
- api calls
chef-zero
- mock all the thingstest-kitchen
- test all the thingskitchen-ec2
- test all the things in the cloud
recipe[yum]
- packagesrecipe[user]
- usersrecipe[cron]
- crontabsrecipe[rsyslog::client]
- log aggregationrecipe[git]
- code checkoutrecipe[chef-solo-search]
- if not using chef serverrecipe[graphite]
- time series graphingrecipe[sudo]
- usersrecipe[redisio]
- log aggregation queuerecipe[java]
- all the thingsrecipe[maven]
- for building seyrenrecipe[postfix]
- alertingrecipe[mysql::server]
- metadata storagerecipe[logstash::server]
- log aggregationrecipe[statsd]
- time series datarecipe[elasticsearch]
- log aggregation and document storerecipe[nginx]
- http(s)recipe[kibana]
- log aggregation visualizationrecipe[jenkins::server]
- continuous integration/deliveryrecipe[chatbot]
- hipchat v2 api botrecipe[chatbot::init]
- init.d for bot
- diamond - metrics & monitoring
- beaver - log shipping
- anthracite - event annotation for metrics
- seyren - better alerting than tattle
- aws-minions - snapshot backups & restores, dynamic dns
- skyline - anomaly detection
- test kitchen - chef continuous integration
- ice - aws billing reports
- revily - On-call scheduling and incident response
Key | Type | Description | Default |
---|---|---|---|
['operations']['user'] | String | system account | operations |
['operations']['ssh_keys'] | String Set [Array] | public ssh keys for system account's autorized_keys | ["","",""] |
['rsyslog']['server_ip'] | String | syslog server for rsyslog forwarding | syslog.internal.operations.com |
['rsyslog']['port'] | Integer | syslog port for rsyslog forwarding | 5544 |
['logstash']['server']['base_config_cookbook'] | String | cookbook with logstash server config template | operations |
['logstash']['server']['install_rabbitmq'] | Boolean | to install rabbitmq or not | false |
['logstash']['server']['xmx'] | String | java max ram | 512M |
['logstash']['server']['xms'] | String | java min ram | 512M |
['statsd']['delete_idle_stats'] | Boolean | delete idle stats | true |
['statsd']['delete_timers'] | Boolean | delete idle timers | true |
['statsd']['delete_gauges'] | Boolean | delete idle gauges | true |
['statsd']['delete_sets'] | Boolean | delete idle sets | true |
['statsd']['delete_counters'] | Boolean | delete idle counters | true |
['statsd']['flush_interval'] | Integer | flush interval in ms - set this the same as diamond! (1 minute here) | 60000 |
['authorization']['sudo']['passwordless'] | Boolean | allow passwordless sudo | true |
['authorization']['sudo']['users'] | String Set [Array] | list of users to allow sudo access | ["ec2-user", "operations"] |
['postfix']['main']['smtpd_use_tls'] | Boolean | use tls when connecting out | false |
['tattle']['listen_port'] | Integer | port for tattle webapp | 8082 |
['tattle']['url'] | String | url for alert emails to link back | tattle.internal.operations.com |
['tattle']['admin_email'] | String | email alerts are from | ops@operations.com |
['tattle']['doc_root'] | String | docroot for tattle webapp | /opt/tattle |
['chatbot']['rooms'] | String Set [Array] | list of hipchat rooms to join | ["alpha", "names"] |
['chatbot']['username'] | String | hipchat account username | realname |
['chatbot']['password'] | String | hipchat password | xx |
['chatbot']['nickname'] | String | nickname for bot | eggdrop |
['chatbot']['api_key'] | String | v2 api key | md5 |
['nginx']['default_domain'] | String | default vhost listener | localhost |
['nginx']['default_site_enabled'] | Boolean | allow default docroot | false |
['nginx']['sites']['proxy'] | String Set [Array of {Object}] | snazzy nginx proxy metadata | [ { "domain":"graphite.internal.operations.com", "directory":"/opt/graphite/webapp/content/", "proxy_location" : "http://localhost:8080" }, { "domain":"anthracite.internal.operations.com", "directory":"/opt/anthracite/", "proxy_location" : "http://localhost:8081" }, { "domain":"tattle.internal.operations.com", "directory":"/opt/tattle/", "proxy_location" : "http://localhost:8082" }, { "domain":"skyline.internal.operations.com", "directory":"/opt/skyline/", "proxy_location" : "http://localhost:1500" }, { "domain":"jenkins.internal.operations.com", "directory":"/opt/jenkins/", "proxy_location" : "http://localhost:8089" } ] |
['nginx']['default_site_enabled'] | Boolean | allow default docroot | false |
['apache']['listen_ports'] | Integer [Array] | ports that apache can vhost listen on | 8080 |
['graphite']['listen_port'] | Integer | graphite vhost listener | 8080 |
['graphite']['graphite_web']['bitmap_support'] | Boolean | compile fancy bitmap support | false |
['kibana']['webserver_hostname'] | String | hostname for kibana | kibana.internal.operations.com |
['kibana']['webserver_listen'] | String | ip to bind to | * |
['elasticsearch']['allocated_memory'] | String | ram for elasticsearch | 2048m |
['elasticsearch']['version'] | String | version to install | 0.90.11 |
['elasticsearch']['path']['data'] | String | path to data store | /opt/elasticsearch/data |
['elasticsearch']['path']['work'] | String | path to work store | /opt/elasticsearch/work |
['elasticsearch']['path']['logs'] | String | path to logs | /var/log/elasticsearch |
['mysql']['server_debian_password'] | String | another password for mysql | xx |
['mysql']['server_repl_password'] | String | root password for mysql | xx |
['mysql']['server_root_password'] | String | another password for mysql | xx |
['jenkins']['server']['port'] | Integer | port jenkins lives on | 8089 |
['jenkins']['server']['home'] | String | data dir | /opt/jenkins |
['jenkins']['server']['url'] | String | url for jenkins | http://jenkins.internal.operations.com |
- Include this on any node for all of the pre-reqs for log and metrics shipping
- Just set:
"rsyslog" => { "server_ip" => "syslog.internal.operations.com", "port" => "5544" }
- Some attributes must be overriden, not defaulted. Check the role json, we use this because of setting and over-riding attributes across a large number of cookbooks.
- If using AWS, it self-snapshots the /opt mounted EBS once an hour by freezing the XFS filesystem, snapshotting and then thawing the drive.
- If using AWS, it uses UserData to check for previous snapshots and loads the latest one instead of creating a new /opt mount (bounce-back servers! you loose up to 1 hour of data/gaps in graphs with this)
- Log Aggregation/Indexing/Querying for your entire Infrastructure
- Time Series data collection and graphing
- Event annotation for tracking operation events such as deploys/downtime along with graphs
- Alerting for Time Series Data
- Jenkins for reporting on timed/cron'd operational tasks or actually used for continuous integration/delivery
- If you are running redis 2.4.x increase the ulimit or upgrade to 2.6.x running out of file descriptors will cause 100% CPU and a non-responsive redis reference
- node.js statsD is the highest CPU User, consider running a C version
- Fork the repository on Github
- Create a named feature branch (like
add_component_x
) - Write your change
- Write tests for your change (if applicable)
- Run the tests, ensuring they all pass
- Submit a Pull Request using Github
Authors: corley@avast.com - anthroprose rdickeyvii@gmail.com - rdickey