Ansible Automation Platform monitor script, for AAP 1.x (Tower) or 2.x (Controller). It connects to whatever automation system you like which can monitor non-zero exit codes. If any issue is detected, script exits with 1.
- Install the AAP CLI tool: https://docs.ansible.com/ansible-tower/latest/html/towercli/
- Create a token to authenticate with: https://docs.ansible.com/ansible-tower/latest/html/administration/oauth2_token_auth.html#application-token-functions
- Clone this repository:
git clone https://github.com/mglantz/aapmonitor
- Install tool and put config in place:
cp aapmonitor/aapmonitor.py /path/to/bin/
cp aapmonitor/aapmonitor.cfg /etc
- Edit the config file to set your configured token and to set monitoring alarm limits
- Run aapmonitor.py to check if it works, then start monitoring by triggering the script from your monitoring system.
- If it's possible to connect to the cluster API to fetch metrics data using the AAP cli tool. This indicates a general sense of health.
- jobs_running: Alert if we run too many jobs
- jobs_pending: Alert if we have too many pending jobs, indicating that the cluster or a node is saturated capacity wise
- jobs_failed_limit: Alert if we have n new jobs which has failed since last check, indicating something wrong with either jobs, cluster, network or access.
- forks_remaining: Alert if we have fewer than n forks of remaining capacity on any node, indicating capacity issues, likely memory or CPU related.
- subs_remaining: Alert if we run out of subscriptions. Even though Ansible Automation Platform does not do subscription management, it's an indication. Turn off by setting to 0.
- inventory_limit: Alert if we detect fewer than n inventories. Indicates potential configuration misstakes which would cause automation to not run.
- projects limit. Indicates potential configuration misstakes which would cause automation to not run.