terraform + ansible to setup a prometheus cluster on an Ubuntu pair in AWS
This project aims to showcase a simple setup - Prometheus running on two Ubuntu servers in AWS, setup in the Hierarchical federation model. Hierarchical federation allows for one Prometheus server to aggregate data from one or more Prometheus servers which may be acting under a tree model. This enables a Primary Prometheus server to collect aggregate data from other servers in the network and showcase the health of the network as a whole.
As part of the automation effort, we are using terraform to create (and destroy) the Ubuntu servers in AWS. We are then using ansible to install and setup Prometheus (and ancillarily nginx) on the servers and to configure the Prometheus setup to monitor itself (and in the case of the Primary server, to monitor the secondary as well).
To use this repo, please follow these instructions -
- Create a key pair in your AWS
us-east-2
region and name itprometheus_key_pair
. You can find the relevant page here. AWS will download the.pem
file locally. - Create an access key under your security credentials and save the key ID and secret for later. The page to do this is here.
- Ensure
terraform
,ansible
, andawscli v2
are installed on your system. My setup is on my Mac, so I was able to usehomebrew
to install these packages - Install the
prometheus
role usingansible galaxy
with the command -ansible-galaxy collection install prometheus.prometheus
- Export
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
to your terminal or save them to terraform.tfvars and import to the main.tf file - Run
terraform init
and thenterraform apply
(orterraform apply -auto-approve
to skip manual agreement prompts) - Once the terraform setup is complete, copy the IPs of the primary and secondary servers from the output of the terraform plan
- Check that the servers are up and running using
ssh -i prometheus_key_pair.pem ubuntu@<server_ip>
on both servers - Replace
<primary_ip>
and<secondary_ip>
in theinventory.ini
and theprometheus_config.yaml
files. Ensure the path of theprometheus_key_pair.pem
file is correct. - Now, you can run ansible, with the command
ansible-playbook -i inventory.ini prometheus.yaml
- Once the ansible setup is complete, you can now visit the Prometheus servers on port
9090
of the public IPs of the servers to view the data being collected. The graphs on the Primary server will include data from the Secondary server as well. My favorite to look at isprocess_cpu_seconds_total
, which shows the CPU usage of the Prometheus servers over time. - Once your work is complete, don't forget to wipe the AWS resources using
terraform destroy -auto-approve
to ensure you're not footing an unnecessary bill 🙂
The relevant files in this repo are as follows -
- README.md - this file, it contains information about the overall goals of the project, as well as instructions on how to execute it locally
- main.tf - this is the file used by terraform to create the resources in AWS. As mentioned before, credentials and the setup of the awscli are not part of this file and should be setup manually beforehand.
- prometheus.yaml - this is the file used by ansible to create the necessary resources inside the Ubuntu servers once they're up and running. It installs Prometheus and nginx, setups up monitoring, and reloads Prometheus to ensure the configuration is accepted.
- prometheus_config.yaml - this and
prometheus_config_secondary.yaml
are files used by ansible to configure the data scrape jobs performed by Prometheus in the Primary and Secondary servers. - inventory.ini - this file is where the crendentials and IPs of the servers is stored for ansible to use to configure the servers. It must be manually populated after terraform outputs the relevant information upon successful completion.
There are a variety of improvements we can do to this setup. They largely fall into two buckets, somewhat corresponding to what we do in ansible and in terraform
- Make the setup more secure
- We should not be exposing the default SSH port to the open internet
- We should potentially use a jumphost (preferable a spot instance jump host) to connect to the other servers in the network
- Consider whether this entire VPC needs any direct internet connectivity. Can it connect via a VPN connection to make it more secure?
- Consider connecting Prometheus not over the public IP but over the Private IP of the servers.
- Can Prometheus itself be secured via passwords/API access keys?
- Remove all hardcoding - the region, the AMI ID, ports for Prometheus, SSH, and nginx, CIDR block for the VPC private IP range should all be configurable and user supplied, with reasonable defaults assumed. Instance type and Spot options should also be programmable.
- Add checks in terraform to wait till the servers are actually provisioned and up and running. Right now, we have to manually check that the servers are working before running ansible, or ansible fails to perform the tasks.
- A CI/CD pipeline to automate the terraform and ansible execution
- To achieve the CI pipeline, we also need to be able to pass public and private IP data (and any other relevant information) from terraform to Ansible. Consider writing a simple script or to use terraform file writing capabilities to write to the inventory.ini and the prometheus_config.yaml files
- Also, two separate files for the Prometheus configuration should not be needed. Consider using Jinja to templatize the setup. There is a benefit to using two separate files though - it's amply clear that the Primary server needs to aggregate data from all secondary servers. So a simple non-template config for the secondary servers (which allows them to only monitor themselves) makes sense.
- Automate AWS resource destruction at the end of the test cycle (if there is testing involved)