kube-entropy

A little chaos engineering application for kubernetes resilience testing.

Prerequisites

Configured kubernetes cluster with an ingress controller deployed
Ingress controller nodeports mapped to 443 and 80.
Configured ~/.kube/config
Installed kubectl (brew install kubectl or https://kubernetes.io/docs/tasks/tools/install-kubectl/)
Successful execution of kubectl get nodes

Setup

Modify ./config/discovery.yaml to fit your needs. There are two major sections, nodes and ingresses. s

nodes section allows you to specify whether you want to periodically drain nodes, how often, and which nodes. These settings are under enabled, interval and fileds+labels (selectors). Interval can be specified as 10s or 1h. enabled is a true or false. labels contains a list of filters based on labels, fields has a list of filters based on fields. Some examples can be found here: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ . It is a pretty powerful tool.
ingresses section allows you to specify the ingress discovery process. You can specify fields and labels selectors, enabled and interval settings like above, but there are three ingress specific settings. protocol allows you to specify a default protocol for non-host specific ingresses -- it is either http or https. Those same ingresses need a default port and a host. In case an ingress route contains a host, we will use that instead. If an ingress has a reference in tls pointing to such a host, we will assume it is https on port 443, otherwise, http on port 80.

Discovery

Run the discovery by executing ./kube-entropy -mode discovery. It will create a test plan file. We capture a bunch of settings, including full ingress uris, http response codes and key http headers.

Stress

In this mode, applications are being stressed out based on the test plan, while we continuosly monitor ingress states. If http status changes, or a set of http headers changes (excluding some basic ones, like Content-Length or Set-Cookie). This indicates an application error or a default backend. Looking at the application logs allows you to determine the source of instability. You might as well can have external monitors enabled. Run this function by executing ./kube-entropy -mode chaos

---
nodes:
  enabled: true
  fields:
    - spec.unschedulable!=true
  labels:
  interval: 5m
ingresses:
  protocol: https
  port: 443
  defaultHost: www.avsatum.com
  selector:
    enabled: true
    interval: 2s
    fields:
      - metadata.namespace=default
      - metadata.namespace!=kube-system
      - metadata.namespace!=docker
    labels:
  successHttpCodes:
    - 2xx
    - 30x
    - 403

It is designed to randomly stress two separate events: pod restarts and node drains. Two types of monitoring are supported: service monitoring and ingress monitoring. Each type of monitoring and stress action is independently controlled by labels, selectors, and timing interval.

In-cluster vs Out of Cluster

Service monitoring

Designed primarily to keep internal communications in check. If a monitored from within the cluster, service endpoints are invoked directly (only TCP checking is used). If monitoring from the outside of the cluster, node ports are checked against some nodePortHost, which is most likely a load balancer. NodePort as well as the service port information is obtained from service definitions. If you use a complex port mapping outside of kubernetes, try deploying kube-entropy into your cluster.

Ingress monitoring

This type of monitoring is useful to determine if the application responds to ingress requests. As with all kubernetes ingresses, these are reverse proxy routes through the ingress controller (usually nginx), into service and pod IPs. When a pod gets deleted, its IP will be removed from the ingress controller configuration. If the ingress controller doesn't referesh its configuration, an ingress call can be potentially routed to a stale pod IP, which is what we're trying to avoid. Ingress monitoring is HTTP-based, a list of acceptable HTTP codes can be specified in the kube-entropy config file:

successHttpCodes:
  - 2xx
  - 3xx
  - 401

Roadmap

DNS disruption
Network connectivity disruption
Support for Istio/Knative

alexlokshin / kube-entropy