grepory / monitorama2016

Literature Review for Fault Detection in Distributed Systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Monitoring is Dead: Long Live Monitoring

The video from Monitorama 2016 is live on Vimeo and clickable below. Slides are available on Speaker Deck.

Watch the talk!

Abstract

Monitoring systems have not changed significantly in 20 years and has fallen behind the way be build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated. Furthermore, it is often people without expert knowledge of systems under observation that are responsible for monitoring and operating them. In this talk, we will explore how our current monitoring capabilities are failing us and discuss how we can build systems that are both reliable and observable while making our lives (or the lives of the people responsible for their operations in production) easier.

References

  1. Fischer, M. Impossibility of Distributed Concensus with One Faulty Process. in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April 1985, pp. 374-382.
  2. Lamport, L., Shostak, R., and Pease, M. The Byzantine Generals Problem. in ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, July 1982, Pages 382-401.
  3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems. in IEEE Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111.
  4. Videla, A. Failure Modes in Distributed Systems. in his blog, December 2013.

Further Reading

Tools

About

Literature Review for Fault Detection in Distributed Systems