MapReduce

Google Infrastructure

  • GFS : Storage
  • MapReduce: Computation
  • Chubby: Coordination

Why Distributed? (in 2003)

  • Storage
  • HDD 1TB
  • Google > 100 billion pages, about 1PB
  • need > 1000 machines
  • Computation
  • 100 MB/s, ~100 day to process 1PB

The world before MapReduce

  • Systems community
  • Distributed = Multi-thread
  • Distributed shared memory
  • HPC community
  • MPI
  • Fault tolerance?

What about doing it in the hard way?

  • send-recv
  • coordination
  • debug
  • optimize
  • fault tolerance

MapReduce

  • Programming framework/model
  • input/output on GFS
  • no persistence
  • online tasks
  • as a service (process) or as a library?
  • auto parallelism, load balance, fault tolerance.

Concept

  • input->map->reduce->output
  • input: a set of kv pairs
  • map (UDF): kv pair -> a set of kv pairs
  • reduce (UDF): a set of kv pairs -> kv pair

Example: word count

  • input k: doc-name, v: doc-content
  • map: k: each word, v: 1
  • reduce: k: each work, v; word count

More examples:

  • grep
  • reversed links
  • sorting (requires extra)

Distributed implementation

  • Input (in GFS) —> (M) Map jobs —> (R) Reduce jobs —> Output (in GFS)
  • Map workers and Reduce workers communicate in network (RPC), possible alternative?
  • A master that assigns jobs to workers
  • UDF partition

Load balancing and pipelining example

  • M=3, R=2
  • flow
    W1: m1-----m1 m3------m3
    W2: m2--------------------m2
    W3: ..........1.1.......1.3..1.2 r1---r1
    W4: ...........2.1.......2.3...2.2 r2---r2
    

Fault tolerance

  • Failure types: networks, server crashes, disk corruption, “gray” failures
  • Master(seldom)
  • Workers
  • Map: relaunch
  • Reduce: relaunch, atomic GFS functions
  • Slow worker = failed worker