MIT 6.824 Distributed Systems

This repository contains all course materials for distributed systems given by MIT, including

Summary

Concepts and general techniques of distributed systems introduced by this course

structure: single master + multiple workers
job: each worker processes a task at a time
- map task: map(k1, v1) -> list(k2 v2)
- reduce task: reduce(k2, list(v2)) -> list(v2)
fault tolerance: master restarts failed task
example
- word count
- distributed grep

structure: single master + chunk servers
- each file is split into 64MB chunks and spread over chunk servers for fault-tolerance and parallel performance
read steps
record append steps
weak consistency: no guarantee for a same record order, failed append could be visible

failure kind: fail-stop
two main replication approaches:
- state tranfer: simpler
- replicated state machine: less network bandwith
log entry: instruction #, type, data
- deterministic: val = x + y
- non-deterministic: timer interrupt, network package arrival etc. may cause divergence
output rule: before primary sends output, must wait for backup to acknowledge all previous log entries

idea: a system like Raft but allows read from local replicas and to yield stale data
linearizable writes: clients' writes are serialized by the leader
FIFO client order: each client specifies an order for its operations (reads and writes)
general-purpose coordination service: a file-system-like tree of znodes and a set of API
- distributed lock without Herd Effect
- configuration master
- test-and-set for distributed systems

Chain Replication (CR): head for write, tail for read. Good for simplicity, load-balancing compared to Raft
apportioned queries: read from any intermidiate node
- each replica stores a list of versions per object, one clean + other dirty
- write: create new dirty version as write passes through, turn to clean as ACK passes back
- read: reply if the latest version is clean; if dirty, ask tail for lastest version number

Petal: a fault-tolerant virtual disk storage service
performance: caching with cache coherence protocol
- write-back scheme
- lock server (LS): request grant revoke release
crash recovery:
- write-ahead log in Petal that other workstations (WS) can read and recover
- lock leases: LS only takes back lock and recovers log after lease expires

ACID: correct behavior of a transaction: Atomic, Consistent, Isolated (serializable), Durable
Pessimistic Concurrency Control: acquire locks before access
Two-phase Locking (2PL): RW locks
Two-phase Commit (2PC): transaction coordinator, PREPARE, COMMIT

External Consistency: strong consistency, usually linearizability
transaction:
- RW transaction: two-phase commit with Paxos-replicated participants
- RO transactions: Snapshot Isolation, i.e. assign every transaction a time-stamp, and each replica stores multiple time-stamped versions of each record. No locking or 2pc, much faster
clock synchronization
- Goolge's time reference system
- TrueTime: return an interval = [earliest, latest]
- two rules: start rule, commit wait

Fast RPC:
- Kernel Bypass: application directly interacts with NIC, toolkit DPDK is provided
- RDMA (Remote Direct Memory Access) NIC: handles all R/W of applications, has its own reliable transfer protocal
Optimistic Concurrency Control: read - bufferd write - validation - commit / abort
- execution phase: read - buffered write
- commit phase: (write) lock - (read) validation - commit backup - commit primary

RDD (Resilient Distributed Datasets): in-memory partitions of read-only data
execution:
- transformations: map filter join lazy
- actions: count collect save instant
fault-tolerance: lineage graph holds dependancy of RDDs, re-execute if fails
application:
- PageRank
- ML tasks
- Generalized MapReduce

Memcached at Facebook: struggle betwwen performance and consisency with memcached technique in server structure

ALPS system: Availability, low Latency, Partition-tolerance, high Scalability
CAP Theorem: strong consistency, availability and partition-tolerance can not all be achieved
consistency models:

							    > causal > FIFO
linearizability (strong consistency) > sequential > causal+  
							    > per-key sequential > eventual

causal+ consistency: causal consistency + convergent conflict handling
implementation:
- overview: client library and data storage nodes per data center
- version: Lamport Logical Clock
- dependency: records potential causality
- conflict handling: default last-writer-wins

Blockstack: an effort to re-decentralize the internet, e.g. DNS, PKI, storage backend

decentralized apps: move ownership of data into user's hands instead of service provider
Public Key Infrastructure (PKI): map names to data locations and public keys, essensial system for secure internet apps. Yet no global PKI has been implemented
Zooko's triangle: unique-global, human-readable, decentralized PKI is hard to achieve