- GFS : Storage
- MapReduce: Computation
- Chubby: Coordination
Why Distributed? (in 2003)
- HDD 1TB
- Google > 100 billion pages, about 1PB
- need > 1000 machines
- 100 MB/s, ~100 day to process 1PB
The world before MapReduce
- Systems community
- Distributed = Multi-thread
- Distributed shared memory
- HPC community
- Fault tolerance?
What about doing it in the hard way?
- fault tolerance
- Programming framework/model
- input/output on GFS
- no persistence
- online tasks
- as a service (process) or as a library?
- auto parallelism, load balance, fault tolerance.
- input: a set of kv pairs
- map (UDF): kv pair -> a set of kv pairs
- reduce (UDF): a set of kv pairs -> kv pair
Example: word count
- input k: doc-name, v: doc-content
- map: k: each word, v: 1
- reduce: k: each word, v: word count
- reversed links
- sorting (requires extra)
- Input (in GFS) —> (M) Map jobs —> (R) Reduce jobs —> Output (in GFS)
- Map workers and Reduce workers communicate in network (RPC)
- why not using external storage same as input/output?
- A master that assigns jobs to workers
- UDF partition
- Linked as library
- debug locally
- bootstrap time?
Load balancing and pipelining example
- M=3, R=2
W1: m1-----m1 m3------m3 W2: m2--------------------m2 W3: ..........1.1.......1.3..1.2 r1---r1 W4: ...........2.1.......2.3...2.2 r2---r2
- Failure types: networks, server crashes, disk corruption, “gray” failures
- Map: relaunch
- Reduce: relaunch, atomic GFS functions
- Slow worker = failed worker
- What is the “scarce” resource?
- Optimization to reduce network traffic
- Combine: local reduce (commutative&associative)
- Locality read