akshayknarayan / kubeinvaders

Gamified Chaos Engineering Tool for Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Table of Contents

  1. Description
  2. Installation
  3. Usage
  4. Architecture
  5. Persistence
  6. Known Problems & Troubleshooting
  7. Metrics
  8. Security
  9. Test Loading and Chaos Experiment Presets
  10. Community
  11. Community blogs and videos
  12. License

These are the slides from the Chaos Engineering speech I prepared for Fosdem 2023. Unfortunately I could not be present at my talk :D but I would still like to share them with the community

Alt Text

Alt Text

Watch Kubernetes logs through web tail console

Alt Text

Define Chaos Experiments

Alt Text

Description

Through k-inv a.k.a. KubeInvaders you can stress a Kubernetes cluster in a fun way and check how resilient it is.

Installation

Try with Docker (only for development purpose)

docker run -p 8080:8080 \
--env K8S_TOKEN=<k8s_service_account_token>  \
--env ENDPOINT=localhost:8080 \
--env INSECURE_ENDPOINT=true \
--env KUBERNETES_SERVICE_HOST=<k8s_controlplane_host> \
--env KUBERNETES_SERVICE_PORT_HTTPS=<k8s_controlplane_port> \
--env NAMESPACE=<comma_separated_namespaces_to_stress> \
luckysideburn/kubeinvaders:develop

Install to Kubernetes with Helm (v3+)

Artifact HUB

helm repo add kubeinvaders https://lucky-sideburn.github.io/helm-charts/
helm repo update

kubectl create namespace kubeinvaders

helm install kubeinvaders --set-string config.target_namespace="namespace1\,namespace2" \
-n kubeinvaders kubeinvaders/kubeinvaders --set ingress.enabled=true --set ingress.hostName=kubeinvaders.io --set deployment.image.tag=v1.9.6

SCC for Openshift

oc adm policy add-scc-to-user anyuid -z kubeinvaders

Route for Openshift

I should add this to the helm chart...

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: kubeinvaders
  namespace: "kubeinvaders"
spec:
  host: "kubeinvaders.io"
  to:
    name: kubeinvaders
  tls:
    termination: Edge

Usage

At the top you will find some metrics as described below:

Alt Text

Current Replicas State Delay is a metric that show how much time the cluster takes to come back at the desired state of pods replicas.

This is a control-plane you can use to switch off & on various features.

Alt Text

YouTube HowTo

Video How To of version v1.9

Start The Chaos Experiment

Press the button "Start" to start automatic pilot (button changes to "Stop" to disable this feature).

Enable Shuffle

Press the button "Enable Shuffle" to randomly switch the positions of pods or k8s nodes (button changes to "Disable Shuffle" to disable this feature).

Enable Auto Jump Between Namespace

Press the button "Auto NS Switch" to randomly switch between namespaces (button changes to "Disable Auto NS Switch" to disable this feature).

Show / Hide pods name

Press the button "Hide Pods Name" to hide the name of the pods under the aliens (button changes to "Show Pods Name" to disable this feature).

Information about current status and events

As described below, on the game screen, near the spaceship, there are details about current cluster, namespace and some configurations.

Alt Text

Under + and - buttons appears a bar with the latest occurred game events.

Alt Text

Do Kube-linter Lint

It is possibile using kube-linter through KubeInvaders in order to scan resources looking for best-practices or improvements to apply.

Example from YouTube

Show Special Keys

Press 'h' or select 'Show Special Keys' from the menu.

Zoom In / Out

Press + or - buttons to increase or decrease the game screen.

Chaos Containers for masters and workers nodes

  • Select from the menu "Show Current Chaos Container for nodes" to watch which container start when you fire against a worker node (not an alien, they are pods).

  • Select from the menu "Set Custom Chaos Container for nodes" to use your preferred image or configuration against nodes.

Architecture

Alt Text

Persistence

"Kinv" uses Redis for save and manage data. Redis is configured with "appendonly".

At moment the helm chart does not support PersistentVolumes but this task is in the to do list...

Known problems and troubleshooting

  • It seems that KubeInvaders does not work with EKS because of problems with ServiceAccount.
  • At moment the installation of KubeInvaders into a namespace that is not named "kubeinvaders" is not supported
  • I have only tested KubeInvaders with a Kubernetes cluster installed through KubeSpray
  • If you don't see aliens please do the following steps:
  1. Open a terminal and do "kubectl logs <pod_of_kubeinvader> -n kubeinvaders -f"
  2. Execute from another terminal curl "https://<your_kubeinvaders_url>/kube/pods?action=list&namespace=namespace1" -k
  3. Open an issue with attached logs

Hands-on Tutorial

To experience KubeInvaders in action, try it out in this free O'Reilly Katacoda scenario, KubeInvaders.

Prometheus Metrics

KubeInvaders exposes metrics for Prometheus through the standard endpoint /metrics

This is an example of Prometheus configuration:

scrape_configs:
- job_name: kubeinvaders
  static_configs:
  - targets:
    - kubeinvaders.kubeinvaders.svc.cluster.local:8080

Example of metrics:

Metric Description
chaos_jobs_node_count{node=workernode01} Total number of chaos jobs executed per node
chaos_node_jobs_total Total number of chaos jobs executed against all worker nodes
deleted_pods_total 16 Total number of deleted pods
deleted_namespace_pods_count{namespace=myawesomenamespace} Total number of deleted pods per namespace

Download Grafana dashboard

Alt Text

Alt Text

Security

In order to restrict the access to the Kubeinvaders endpoint add this annotation into the ingress.

nginx.ingress.kubernetes.io/whitelist-source-range: <your_ip>/32

Test Loading and Chaos Experiment Presets

Cassandra

from cassandra.cluster import Cluster
from random import randint
import time

def main():
    cluster = Cluster(['127.0.0.1'])
    session = cluster.connect()

    session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 1 }")
    session.execute("CREATE TABLE IF NOT EXISTS test.messages (id int PRIMARY KEY, message text)")

    for i in range(1000):
        session.execute("INSERT INTO test.messages (id, message) VALUES (%s, '%s')" % (i, str(randint(0, 1000))))
        time.sleep(0.001)

    cluster.shutdown()

if __name__ == "__main__":
    main()

Consul

import time
import consul

# Connect to the Consul cluster
client = consul.Consul()

# Continuously register and deregister a service
while True:
    # Register the service
    client.agent.service.register(
        "stress-test-service",
        port=8080,
        tags=["stress-test"],
        check=consul.Check().tcp("localhost", 8080, "10s")
    )

    # Deregister the service
    client.agent.service.deregister("stress-test-service")

    time.sleep(1)

Elasticsearch

import time
from elasticsearch import Elasticsearch

# Connect to the Elasticsearch cluster
es = Elasticsearch(["localhost"])

# Continuously index and delete documents
while True:
    # Index a document
    es.index(index="test-index", doc_type="test-type", id=1, body={"test": "test"})

    # Delete the document
    es.delete(index="test-index", doc_type="test-type", id=1)

    time.sleep(1)

Etcd3

import time
import etcd3

# Connect to the etcd3 cluster
client = etcd3.client()

# Continuously set and delete keys
while True:
    # Set a key
    client.put("/stress-test-key", "stress test value")

    # Delete the key
    client.delete("/stress-test-key")

    time.sleep(1)

Gitlab

import gitlab
import requests
import time

gl = gitlab.Gitlab('https://gitlab.example.com', private_token='my_private_token')

def create_project():
    project = gl.projects.create({'name': 'My Project'})
    print("Created project: ", project.name)

def main():
    for i in range(1000):
        create_project()
        time.sleep(0.001)

if __name__ == "__main__":
    main()

Http

import time
import requests

# Set up the URL to send requests to
url = 'http://localhost:8080/'

# Set up the number of requests to send
num_requests = 10000

# Set up the payload to send
payload = {'key': 'value'}

# Send the requests
start_time = time.time()
for i in range(num_requests):
    requests.post(url, json=payload)
end_time = time.time()

# Calculate the throughput
throughput = num_requests / (end_time - start_time)
print(f'Throughput: {throughput} requests/second')

Jira

import time
from jira import JIRA

# Connect to the Jira instance
jira = JIRA(
    server="https://jira.example.com",
    basic_auth=("user", "password")
)

# Continuously create and delete issues
while True:
    # Create an issue
    issue = jira.create_issue(
        project="PROJECT",
        summary="Stress test issue",
        description="This is a stress test issue.",
        issuetype={"name": "Bug"}
    )

    # Delete the issue
    jira.delete_issue(issue)

    time.sleep(1)

Kafka

import time
import random

from kafka import KafkaProducer

# Set up the Kafka producer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

# Set up the topic to send messages to
topic = 'test'

# Set up the number of messages to send
num_messages = 10000

# Set up the payload to send
payload = b'a' * 1000000

# Send the messages
start_time = time.time()
for i in range(num_messages):
    producer.send(topic, payload)
end_time = time.time()

# Calculate the throughput
throughput = num_messages / (end_time - start_time)
print(f'Throughput: {throughput} messages/second')

# Flush and close the producer
producer.flush()
producer.close()

Kubernetes

import time
import kubernetes

# Create a Kubernetes client
client = kubernetes.client.CoreV1Api()

# Continuously create and delete pods
while True:
    # Create a pod
    pod = kubernetes.client.V1Pod(
        metadata=kubernetes.client.V1ObjectMeta(name="stress-test-pod"),
        spec=kubernetes.client.V1PodSpec(
            containers=[kubernetes.client.V1Container(
                name="stress-test-container",
                image="nginx:latest"
            )]
        )
    )
    client.create_namespaced_pod(namespace="default", body=pod)

    # Delete the pod
    client.delete_namespaced_pod(name="stress-test-pod", namespace="default")

    time.sleep(1)

Mongodb

import time
import random
from pymongo import MongoClient

# Set up the MongoDB client
client = MongoClient('mongodb://localhost:27017/')

# Set up the database and collection to use
db = client['test']
collection = db['test']

# Set up the number of documents to insert
num_documents = 10000

# Set up the payload to insert
payload = {'key': 'a' * 1000000}

# Insert the documents
start_time = time.time()
for i in range(num_documents):
    collection.insert_one(payload)
end_time = time.time()

# Calculate the throughput
throughput = num_documents / (end_time - start_time)
print(f'Throughput: {throughput} documents/second')

# Close the client
client.close()

Mysql

import time
import mysql.connector

# Connect to the MySQL database
cnx = mysql.connector.connect(
    host="localhost",
    user="root",
    password="password",
    database="test"
)
cursor = cnx.cursor()

# Continuously insert rows into the "test_table" table
while True:
    cursor.execute("INSERT INTO test_table (col1, col2) VALUES (%s, %s)", (1, 2))
    cnx.commit()
    time.sleep(1)

# Close the database connection
cnx.close()

Nomad

import time
import nomad

# Create a Nomad client
client = nomad.Nomad()

# Create a batch of jobs to submit to Nomad
jobs = [{
    "Name": "stress-test-job",
    "Type": "batch",
    "Datacenters": ["dc1"],
    "TaskGroups": [{
        "Name": "stress-test-task-group",
        "Tasks": [{
            "Name": "stress-test-task",
            "Driver": "raw_exec",
            "Config": {
                "command": "sleep 10"
            },
            "Resources": {
                "CPU": 500,
                "MemoryMB": 512
            }
        }]
    }]
}]

# Continuously submit the batch of jobs to Nomad
while True:
    for job in jobs:
        client.jobs.create(job)
    time.sleep(1)

Postgresql

import time
import random
import psycopg2

# Set up the connection parameters
params = {
    'host': 'localhost',
    'port': '5432',
    'database': 'test',
    'user': 'postgres',
    'password': 'password'
}

# Connect to the database
conn = psycopg2.connect(**params)

# Set up the cursor
cur = conn.cursor()

# Set up the table and payload to insert
table_name = 'test'
payload = 'a' * 1000000

# Set up the number of rows to insert
num_rows = 10000

# Insert the rows
start_time = time.time()
for i in range(num_rows):
    cur.execute(f"INSERT INTO {table_name} (col) VALUES ('{payload}')")
conn.commit()
end_time = time.time()

# Calculate the throughput
throughput = num_rows / (end_time - start_time)
print(f'Throughput: {throughput} rows/second')

# Close the cursor and connection
cur.close()
conn.close()

Prometheus

import time
import random
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

# Set up the metrics registry
registry = CollectorRegistry()

# Set up the metric to push
gauge = Gauge('test_gauge', 'A test gauge', registry=registry)

# Set up the push gateway URL
push_gateway = 'http://localhost:9091'

# Set up the number of pushes to send
num_pushes = 10000

# Set up the metric value to push
value = random.random()

# Push the metric
start_time = time.time()
for i in range(num_pushes):
    gauge.set(value)
    push_to_gateway(push_gateway, job='test_job', registry=registry)
end_time = time.time()

# Calculate the throughput
throughput = num_pushes / (end_time - start_time)
print(f'Throughput: {throughput} pushes/second')

Rabbit

import pika
import time

def send_message(channel, message):
    channel.basic_publish(exchange='', routing_key='test_queue', body=message)
    print("Sent message: ", message)

def main():
    connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
    channel = connection.channel()
    channel.queue_declare(queue='test_queue')

    for i in range(1000):
        send_message(channel, str(i))
        time.sleep(0.001)

    connection.close()

if __name__ == "__main__":
    main()

Ssh

import paramiko

# Define servers array
servers = ['server1', 'server2', 'server3']

for server in servers:
    public_key = paramiko.RSAKey(data=b'your-public-key-string')
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(hostname='your-server-name', username='your-username', pkey=public_key)
    stdin, stdout, stderr = ssh.exec_command('your-command')
    print(stdout.read())
    ssh.close()

Vault

import time
import hvac

# Connect to the Vault instance
client = hvac.Client()
client.auth_approle(approle_id="approle-id", secret_id="secret-id")

# Continuously read and write secrets
while True:
    # Write a secret
    client.write("secret/stress-test", value="secret value")

    # Read the secret
    client.read("secret/stress-test")

    time.sleep(1)

Community

Please reach out for news, bugs, feature requests, and other issues via:

Community blogs and videos

License

KubeInvaders is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

About

Gamified Chaos Engineering Tool for Kubernetes

License:Apache License 2.0


Languages

Language:JavaScript 40.4%Language:Python 21.6%Language:HTML 13.8%Language:Lua 13.3%Language:Shell 6.1%Language:Dockerfile 2.3%Language:CSS 1.5%Language:Mustache 1.0%