CG for Day 2 (day) https://app.k8s.chaosgenius.io
Imagine you're moving into a house. If Day 1 operations are moving into the house (installation), Day 2 operations are the "housekeeping" stage of a software's life cycle. The care and feeding of the software, maintaining the overall stability and health of your software in production.
- Bringing up Production grade infra
- Easy path to Install/Upgrade ( https://app.k8s.chaosgenius.io )
- Autoscaling for anomaly/DD workers, will add more workers incase there are wating anomaly/dd tasks in queue (POC here : https://github.com/chaos-genius/celery-worker-autoscale )
- Autoscaling k8s cluster by adding more ec2 instance, if current servers are fully occupied and cannot accomodate workers required for picking up pending tasks
- Observability with Grafana: resource utilization and log aggregation for running pods using https://grafana.k8s.chaosgenius.io/
- storing cloud provider acc credentials to environment variable
-
terraform apply
# Infrastructure as code(IaC)
-
helm upgrade --install my-chaosgenius chaosgenius --version=0.6.0
# Installs 0.6.0 version ofchaogenius
if not already installed else upgrades the existing installation(a.k.a "release" in Helm terminology) named "my-chaosgenius".
- k8s cluster provisioning for non-aws cloud vendors
- Exploration of other trigger strategy for scaling workers, currently we scale if #tasks waiting in queue beyond a certain number for more than a certain duration
- Cluster autoscaler take a minute or 2 to add more nodes incase current worker nodes are occupied and there are dd/anomaly workers pending to be scheduled, we can look into Karpenter (https://karpenter.sh/, https://towardsdev.com/karpenter-vs-cluster-autoscaler-dd877b91629b) this is pretty new and was released few months back
- setting up appropriate Pod resource limits for better utilisation
- currently if persistent volume binds to addtional nodes the cluster doesnot scale down back to 0 ( this should be a fairly simple fix )