mrsimonemms / post-mortems

How to run effective incident post-morterms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Post-mortems

Post-mortem documents to learn lessons from our incidents

What is a post-mortem?

Our software will break. The only way to write truly secure and reliable applications is to write no code.

We must learn from our mistakes so that these issues don't cascade.

Guiding principles

For more information on the subject, see the Google SRE book

In order to be effective, our post-mortems must be:

  • open
  • blameless
  • constructive

Open

The software we make for our customers is not built in isolation. No one person knows everything about everything we build, so we cannot expect to solve all problems by working in isolation.

Trust is fostered by honesty. This repository is open to all.

Blameless

An atmosphere of blame leads to a culture where problems are ignored.

If we feel like we're blamed for a problem, we get fearful for our jobs. Problems that are easily solvable become magnified because we will be blamed. Oftentimes, by apportioning blame we miss the root cause and the opportunity of preventing the problem from recurring.

In its simplest form, Person A may have done something wrong, but they were only able to make that mistake because of the systems in place that led to them being able to make a mistake. By punishing Person A, it doesn't solve the root cause and failure to solve the root cause means that the identity of Person A becomes pot luck.

If we have a bad pit stop, it's not because the mechanic has just underperformed, it's because his equipment is not up to the job or the training hasn't been good enough or our wheel nuts are not how they should be.

Toto Wolff, Team Principal, Mercedes F1

When looked at like this, we achieve nothing when we blame the person. We are still relying on luck to avoid a repetition of the problem.

Constructive

We commit that our post-mortems have actions and that we will complete them. Our actions will be assigned an owner who is responsible for ensuring that the action is completed with a set time.

Creating a new incident

To create a new incident, run make new-incident and follow the instructions.

Open in a container

Commit style

All commits must be done in the Conventional Commit format.

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

About

How to run effective incident post-morterms

License:Apache License 2.0


Languages

Language:JavaScript 83.9%Language:Makefile 16.1%