CockroachDB Runbook Template
Overview
This document is a Template for creating a custom CockroachDB Runbook, a.k.a. CockroachDB Operation Manual
A runbook is a reference document which describes a CockroachDB deployment in a specific application environment with related tasks, checklists, and operational procedures.
This template provides an overall structure and implementation outlines for common CockroachDB operating procedures, expediting the creation of a custom runbook - an important deliverable of the overall IT system to ensure a required state of preparedness.
Customers who already have a CockroachDB runbook can use this template to check their existing manual for completeness.
In practice, CockroachDB operators will strive to automate most of the checks and procedures. This template, however, is focused on documenting the detailed checklists and steps comprising individual operational procedures. The automation of these procedures is not in scope of this document.
Terms Used in this Document
CockroachDB Node is an instance of a cockroach server process. To underscore this point - a node is neither a [virtual] server nor an instance of an OS nor a container. Cockroach Labs strongly recommends running one CockroachDB node per one instance of an OS or per container.
CockroachDB Cluster is a set of connected CockroachDB Nodes that form a single system that works together on all tasks.
Platform is a set of compatible hardware, virtualized or containerized hardware, as well as related structures, on which CockroachDB can be run. Platform examples are bare metal x86_64, AWS EC2, Google Cloud Platform, Microsoft Azure, VMware vSphere, Docker, Kubernetes.
Contents
- Service or System Overview
- Business Overview
- Technical Overview
- Hardware Platform
- Virtualization or Containerization
- Operating system
- Clock Management
- Network Design
- Data Volumes
- Planned Capacity
- Cluster Right-Sizing, Expansion Strategy
- Cluster Topology and Configuration
- Auto-Scaling
- Connection Management (Pooling, Balancing, Failover/Failback)
- Transactions: Implicit vs. Explicit
- Transaction Retries
- Upstream Dependent Systems
- Downstream Dependent Systems
- Ecosystem Tools
- Deployment and Configuration management tools
- Routine Maintenance Procedures
- The Most Common Problems experienced by CockroachDB users
- Monitoring and Alerting
- Diagnostic and Support
- Emergency Procedures / Operation Continuity
Useful Resources and Examples
- Monitoring Alerts deployed Cockroach Cloud managed Service: (common, dedicated, host)
- Including the 6 alerts delivered to users of Cockroach Cloud Dedicated
- Available Monitoring Metrics