KlavierCat / post-mortems

A collection of postmortems. Pull requests welcome :-)

Home Page:http://danluu.com/postmortem-lessons/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A List of Post-mortems!



Table of Contents

Config Errors

Hardware/Power Failures

Conflicts

Time

Uncategorized

Other lists of postmortems

Analysis

Contributors

Config Errors

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Facebook. A bad config took down both Facebook and Instagram.

Google Cloud. A bad config (autogenerated) removed all Google Compute Engine IP blocks from BGP announcements.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Google. / was checked into the URL blacklist, causing every URL to show a warning.

Microsoft. A bad config took down Azure storage.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

Sentry. Wrong Amazon S3 settings on backups lead to data leak.

TravisCI. A configuration issue (incomplete password rotation) led to "leaking" VMs, leading to elevated build queue times.

TravisCI. A configuration issue (automated age-based Google Compute Engine VM image cleanup job) caused stable base VM images to be deleted.

Valve. Although there's no official postmortem, it looks like a bad BGP config severed Valve's connection to Level 3, Telia, and Abovenet/Zayo, which resulted in a global Steam outage.

Hardware/Power Failures

Amazon. An unknown event caused a transformer to fail. One of the PLCs that checks that generator power is in phase failed for an unknown reason, which prevented a set of backup generators from coming online. This affected EC2, EBS, and RDS in EU West.

Amazon. Bad weather caused power failures throughout AWS US East. A single backup generator failed to deliver stable power when power switched over to backup and the generator was loaded. This is despite having passed a load tests two months earlier, and passing weekly power-on tests.

Amazon. At 10:25pm PDT on June 4, loss of power at an AWS Sydney facility resulting from severe weather in that area lead to disruption to a significant number of instances in an Availability Zone. Due to the signature of the power loss, power isolation breakers did not engage, resulting in backup energy reserves draining into the degraded power grid.

ARPANET. A malfunctioning IMP (Interface Message Processor) corrupted routing data, software recomputed checksums propagating bad data with good checksums, incorrect sequence numbers caused buffers to fill, full buffers caused loss of keepalive packets and nodes took themselves off the network. From 1980.

FirstEnergy / General Electric. FirstEnergy had a local failure when some transmission lines hit untrimmed foliage. The normal process is to have an alarm go off, which causes human operators to re-distribute power. But the GE system that was monitoring this had a bug which prevented the alarm from getting triggered, which eventually caused a cascading failure that eventually affected 55 million people.

GitHub. On January 28th, 2016 GitHub experienced a disruption in the power at their primary datacenter.

Google. Successive lightning strikes on their European datacenter (europe-west1-b) caused loss of power to Google Compute Engine storage systems within that region. I/O errors were observed on a subset of Standard Persistent Disks (HDDs) and permanent data loss was observed on a small fraction of those.

Sun/Oracle. Sun famously didn't include ECC in a couple generations of server parts. This resulted in data corruption and crashing. Following Sun's typical MO, they made customers that reported a bug sign an NDA before explaining the issue.

Conflicts

CCP Games A typo and a name conflict caused the installer to sometimes delete the boot.ini file on installation of an expansion for EVE Online - with consequences.

GoCardless. All queries on a critical PostgreSQL table were blocked by the combination of an extremely fast database migration and a long-running read query, causing 15 seconds of downtime.

Knight Capital. A combination of conflicting deployed versions and re-using a previously used bit caused a $460M loss.

WebKit code repository. The WebKit repository, a Subversion repository configured to use deduplication, became unavailable after two files with the same SHA-1 hash were checked in as test data, with the intention of implementing a safety check for collisions. The two files had different md5 sums and so a checkout would fail a consistency check. For context, the first public SHA-1 hash collision had very recently been announced, with an example of two colliding files.

Time

Azure Certificates that were valid for one year were created. Instead of using an appropriate library, someone wrote code that computed one year to be the current date plus one year. On February 29th 2012, this resulted in the creation of certificates with an expiration date of February 29th 2013, which were rejected because of the invalid date. This caused an Azure global outage that lasted for most of a day.

Cloudflare Backwards time flow from tracking the 27th leap second on 2016-12-31T23:59:60Z caused the weighted round-robin selection of DNS resolvers (RRDNS) to panic and fail on some CNAME lookups. Go's time.Now() was incorrectly assumed to be monotonic; this injected negative values into calls to rand.Int63n(), which panics in that case.

Linux Leap second code was called from the timer interrupt handler, which held xtime_lock. That code did a printk to log the leap second. printk wakes up klogd, which can sometimes try to get the time, which waits on xtime_lock, causing a deadlock.

Linux When a leap second occured, CLOCK_REALTIME was rewound by one second. This was not done via a mechanism that would update hrtimer base.offset (clock_was_set). This meant that when a timer interrupt happened, TIMER_ABSTIME CLOCK_REALTIME timers got expired one second early, including timers set for less than one second. This caused applications that used sleep for less than one second in a loop to spinwait without sleeping, causing high load on many systems. This caused a large number of web services to go down in 2012.

Uncategorized

Allegro. The Allegro platform suffered a failure of a subsystem responsible for asynchronous distributed task processing. The problem affected many areas, e.g. features such as purchasing numerous offers via cart and bulk offer editing (including price list editing) did not work at all. Moreover, it partially failed to send daily newsletter with new offers. Also some parts of internal administration panel were affected.

Amazon. Human error. On February 28th 2017 9:37AM PST, the Amazon S3 team was debugging a minor issue. Despite using an established playbook, one of the commands intending to remove a small number of servers was issued with a typo, inadvertently causing a larger set of servers to be removed. These servers supported critical S3 systems. As a result, dependent systems required a full restart to correctly operate, and the system underwent widespread outages for US-EAST-1 (Northern Virginia) until final resolution at 1:54PM PST. Since Amazon's own services such as EC2 and EBS rely on S3 as well, it caused a vast cascading failure which affected hundreds of companies.

Amazon. Message corruption caused the distributed server state function to overwhelm resources on the S3 request processing fleet.

Amazon. Human error during a routine networking upgrade led to a resource crunch, exacerbated by software bugs, that ultimately resulted in an outage across all US East Availability Zones as well as a loss of 0.07% of volumes.

Amazon. Elastic Load Balancer ran into problems when "a maintenance process that was inadvertently run against the production ELB state data".

Amazon. A "network disruption" caused metadata services to experience load that caused resposne times to exceed timeout values, causing storage nodes to take themselves down. Nodes that took themselves down continued to retry, ensuring that load on metadata services couldn't decrease.

AppNexus. A double free revealed by a database update caused all "impression bus" servers to crash simultaneously. This wasn't caught in staging and made it into production because a time delay is required to trigger the bug, and the staging period didn't have a built-in delay.

AT&T. A bad line of C code introduced a race hazard which in due course collapsed the phone network. After a planned outage, the quickfire resumption messages triggered the race, causing more reboots which retriggered the problem. "The problem repeated iteratively throughout the 114 switches in the network, blocking over 50 million calls in the nine hours it took to stabilize the system." From 1990.

Bitly. Hosted source code repo contained credentials granting access to bitly backups, including hashed passwords.

BrowserStack. An old prototype machine with the Shellshock vulnerability still active had secret keys on it which ultimately led to a security breach of the Production system.

Buildkite. Database capacity downgrade in an attempt to minimise AWS spend resulted in lack of capacity to support Buildkite customers at peak, leading to cascading collapse of dependent servers.

CCP Games. A problematic logging channel caused cluster nodes dying off during the cluster start sequence after rolling out a new game patch.

CircleCI. A GitHub outage and recovery caused an unexpectedly large incoming load. For reasons that aren't specified, a large load causes CircleCI's queue system to slow down, in this case to handling one transaction per minute.

Dropbox. This postmortem is pretty thin and I'm not sure what happened. It sounds like, maybe, a scheduled OS upgrade somehow caused some machines to get wiped out, which took out some databases.

European Space Agency. An overflow occured when converting a 16-bit number to a 64-bit numer in the Ariane 5 intertial guidance system, causing the rocket to crash. The actual overflow occured in code that wasn't necessary for operation but was running anyway. According to one account, this caused a diagnostic error message to get printed out, and the diagnostic error message was somehow interpreted as actual valid data. According to another account, no trap handler was installed for the overflow.

Etsy. First, a deploy that was supposed to be a small bugfix deploy also caused live databases to get upgraded on running production machines. To make sure that this didn't cause any corruption, Etsy stopped serving traffic to run integrity checks. Second, an overflow in ids (signed 32-bit ints) caused some database operations to fail. Etsy didn't trust that this wouldn't result in data corruption and took down the site while the upgrade got pushed.

Foursquare. MongoDB fell over under load when it ran out of memory. The failure was catastrophic and not graceful due to a a query pattern that involved a read-load with low levels of locality (each user check-in caused a read of all check-ins for the user's history, and records were 300 bytes with no spatial locality, meaning that most of the data pulled in from each page was unecessary). A lack of monitoring on the MongoDB instances caused the high load to go undetected until the load became catastrophic, causing 17 hours of downtime spanning two incidents in two days.

Gitlab 2014. After the primary locked up and was restarted, it was brought back up with the wrong filesystem, causing a global outage. See also HN discussion.

Gitlab 2017. Influx of requests overloaded the database, caused replication to lag, tired admin deleted the wrong directory, six hours of data lost. See also earlier report and HN discussion.

Gliffy. While attempting to resolve an issue with a backup system the production database was accidentally deleted causing a system outage.

Google. Checking the vendor string instead of feature flags renders NaCl unusable on otherwise compatible non-mainstream hardware platforms.

Google. A mail system emailed people more than 20 times. This happened because mail was sent with a batch cron job that sent mail to everyone who was marked as waiting for mail. This was a non-atomic operation and the batch job didn't mark people as not waiting until all messages were sent.

GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.

Healthcare.gov.

Heroku. Having a system that requires scheduled manual updates resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.

Intel. A scripting bug caused the generation of the divider logic in the Pentium to very occasionally produce incorrect results. The bug wasn't caught in testing because of an incorrect assumption in a proof of correctness.

Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintence taking a lock on something, and a Joyent query that unecessarily tried to take a global lock.

Kickstarter. Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.

Kings College London. 3PAR suffered catastrophic outage which highlighted a failure in internal process.

Mailgun. Secondary MongoDB servers became overloaded and while troubleshooting accidentally pushed a change that sent all secondary traffic to the primary MongoDB server, overloading it as well and exacerbating the problem.

Medium. Polish users were unable to use their "Ś" key on Medium.

NASA. A design flaw in the Apollo 11 rendezvous radar produced excess CPU load, causing the spacecraft computer to restart during lunar landing.

NASA. Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. This is basically the same issue Sweden ran into in 1628 with its ship.

NASA. NASA's Mars Pathfilder spacecraft experienced system resets a few days after landing on Mars (1997). Debugging features were remotely enabled until the cause was found: a priority inversion problem in the VxWorks operating system. The OS software was remotely patched (all the way to Mars) to fix the problem by adding priority inheritance to the task scheduler.

Netflix. An EBS outage in one availability zone was mitigated by migrating to other availability zones.

Platform.sh. Outage during a scheduled maintenance window because there were too much data for Zookeeper to boot.

Reddit experienced an outage for 1.5 hours, followed by another 1.5 hours of degraded performance on Thursday August 11 2016. This was due to an error during a migration of a critical backend system.

Salesforce Initial disruption due to power failure in one datacenter led to cascading failures with a database cluster and file discrepancies resulting in cross data center failover issues.

Sentry. Transaction ID Wraparound in Postgres caused Sentry to go down for most of a working day.

Spotify. Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.

Stack Exchange. Enabling StackEgg for all users resulted in heavy load on load balancers and consequently, a DDoS.

Stack Exchange. Backtracking implementation in the underlying regex engine turned out to be very expensive for a particular post leading to health-check failures and eventual outage.

Stack Exchange. Porting old Careers 2.0 code to the new Developer Story caused a leak of users' information.

Stack Exchange. The primary SQL-Server triggered a bugcheck on the SQL Server process, causing the Stack Exchange sites to go into read only mode, and eventually a complete outage.

Strava. Hit the signed integer limit on a primary key, causing uploads to fail.

Stripe. Manual operations are regularly executed on production databases. A manual operation was done incorrectly (missing dependency), causing the Stripe API to go down for 90 minutes.

Sweden. Use of different rulers by builders caused the Vasa to be more heavily built on its port side and the ship's designer, not having built a ship with two gun decks before, overbuilt the upper decks, leading to a design that was top heavy. Twenty minutes into its maiden voyage in 1628, the ship heeled to port and sank.

Tarsnap. A batch job which scans for unused blocks in Amazon S3 and marks them to be freed encountered a condition where all retries for freeing certain blocks would fail. The batch job logs its actions to local disk and this log grew without bound. When the filesystem filled, this caused other filesystem writes to fail, and the Tarsnap service stopped. Manually removing the log file restored service.

Therac-25. The Therac-25 was a radiation therapy machine involved in at least six accidents between 1985 and 1987 in which patients were given massive overdoses of radiation. Because of concurrent programming errors, it sometimes gave its patients radiation doses that were thousands of times greater than normal, resulting in death or serious injury.

Valve. Steam's desktop client deleted all local files and directories. The thing I find most interesting about this is that, after this blew up on social media, there were widespread reports that this was reported to Valve months earlier. But Valve doesn't triage most bugs, resulting in an extremely long time-to-mitigate, despite having multiple bugreports on this issue.

Yeller. A network partition in a cluster caused some messages to get delayed, up to 6-7 hours. For reasons that aren't clear, a rolling restart of the cluster healed the partition. There's some suspicious that it was due to cached routes, but there wasn't enough logging information to tell for sure.

Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! is a pretty good resource; other links to collections of post mortems are also appreciated.

Other lists of postmortems

Availability Digest website.

Google+ postmortems community.

John Daily's list of postmortems (in json).

Jeff Hammerbacher's list of postmortems.

NASA lessons learned database.

Tim Freeman's list of postmortems

Wikimedia's postmortems.

Autopsy.io's list of Startup failures.

Analysis

How Complex Systems Fail

John Allspaw on Resilience Engineering

Contributors

  • Aaron Wigley
  • Ahmet Alp Balkan
  • Allon Murienik
  • Amber Yust
  • Anthony Elizondo
  • Anuj Pahuja
  • Benjamin Gilbert
  • Brad Baris
  • Brendan McLoughlin
  • Brian Scanlan
  • Brock Boland
  • Chris Higgs
  • Connor Shea
  • Dan Luu
  • Dan Nguyen
  • David Pate
  • Dov Murik
  • Ed Spittles
  • Florent Genette
  • Franck Arnulfo
  • gnomon
  • Grey Baker
  • Jacob Kaplan-Moss
  • James Graham
  • Jason Dusek
  • John Daily
  • jomo
  • Julia Hansbrough
  • Julian Szulc
  • Justin Montgomery
  • KS Chan
  • Kevin Brown
  • Kunal Mehta
  • Luan Cestari
  • Mark Dennehy
  • Massimiliano Arione
  • Matt Day
  • Michael Robinson
  • Mike Doherty
  • Mohit Agarwal
  • Nat Welch
  • Nate Parsons
  • Nick Sweeting
  • Raul Ochoa
  • Ruairi Carroll
  • Samuel Hunter
  • Sean Escriva
  • Siddharth Kannan
  • Tim Freeman
  • Tom Crayford
  • Vaibhav Bhembre
  • Veit Heller
  • Vincent Ambo

About

A collection of postmortems. Pull requests welcome :-)

http://danluu.com/postmortem-lessons/