MapReduce
Google Infrastructure
- GFS : Storage
- MapReduce: Computation
- Chubby: Coordination
Why Distributed? (in 2003)
- Storage
- HDD 1TB
- Google > 100 billion pages, about 1PB
- need > 1000 machines
- Computation
- 100 MB/s, ~100 day to process 1PB
The world before MapReduce
- Systems community
- Distributed = Multi-thread
- Distributed shared memory
- HPC community
- MPI
- Fault tolerance?
What about doing it in the hard way?
- send-recv
- coordination
- debug
- optimize
- fault tolerance
MapReduce
- Programming framework/model
- input/output on GFS
- no persistence
- online tasks
- as a service (process) or as a library?
- auto parallelism, load balance, fault tolerance.
Concept
- input->map->reduce->output
- input: a set of kv pairs
- map (UDF): kv pair -> a set of kv pairs
- reduce (UDF): a set of kv pairs -> kv pair
Example: word count
- input k: doc-name, v: doc-content
- map: k: each word, v: 1
- reduce: k: each work, v; word count
More examples:
- grep
- reversed links
- sorting (requires extra)
Distributed implementation
- Input (in GFS) —> (M) Map jobs —> (R) Reduce jobs —> Output (in GFS)
- Map workers and Reduce workers communicate in network (RPC), possible alternative?
- A master that assigns jobs to workers
- UDF partition
Load balancing and pipelining example
- M=3, R=2
- flow
W1: m1-----m1 m3------m3 W2: m2--------------------m2 W3: ..........1.1.......1.3..1.2 r1---r1 W4: ...........2.1.......2.3...2.2 r2---r2
Fault tolerance
- Failure types: networks, server crashes, disk corruption, “gray” failures
- Master(seldom)
- Workers
- Map: relaunch
- Reduce: relaunch, atomic GFS functions
- Slow worker = failed worker