There are 20 repositories under fault-tolerance topic.
These are the best resources for System Design on the Internet
Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
Highly-available Distributed Fault-tolerant Runtime
A list of papers about distributed consensus.
Service Discovery and Governance Platform for Microservice and Distributed Architecture
List of Elixir books
Asynchronous & Fault-tolerant PHP Framework for Distributed Applications.
Golem is an open source durable computing platform that makes it easy to build and deploy highly reliable distributed systems.
A library for replicating your python class between multiple servers, based on raft protocol
An open source Valkey client library that supports Valkey, and Redis open source 6.2, 7.0 and 7.2. Valkey GLIDE is designed for reliability, optimized performance, and high-availability, for Valkey and Redis OSS based applications. GLIDE is a multi language client library, written in Rust with programming language bindings, such as Java and Python
Simmy is a chaos-engineering and fault-injection tool, integrating with the Polly resilience project for .NET
**No Longer Maintained** Official RAMCloud repo
Notes on Lindsey Kuper's lectures on Distributed Systems
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
Python Actor concurrency library
Must-read Papers for File System (FS)
Serverless chaos monkey for AWS (runs on AWS Lambda) ☁️ 💥
A daemon, running in background on a Linux router or firewall, monitoring the state of multiple internet uplinks/providers and changing the routing accordingly. LAN/DMZ internet traffic is load balanced between the uplinks.
Implementation of RAFT distributed consensus algorithm among TCP Peers on .NET / .NETStandard / .NETCore / dotnet
ZIO-native utilities for making resilient distributed systems
Lightweight Java SDK used as Proxyless Service Governance