목차
ppt
MapReduce
3 steps of MapReduce
- Map:
- Apply a user-written Map function to each input element
- Group by key: Sort and shuffle
- System sorts all the key-value pairs by key, and outputs key/(list of values) pairs
- Reduce:
- User-written Reduce function is applied to each key/(list of values)
Map-Reduce: A diagram

Map-Reduce: In Parallel

Example Word Count

MapReduce Pattern

MapReduce Phases

Deciding on what will be the key and what will be the value
➔ programmer’s responsibility
MapReduce: Execution Framework
MapReduce Execution Framework takes care of:
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the group by key step
- In practice, this is is the bottleneck
- Handling machine failures
- Managing required inter-machine communication
Example 1: Word Count

Example 1: Word Count

Q: What are the Key and Value Pairs of Map and Reduce?
Map: Key=word, Value=1
Reduce: Key=word, Value=aggregated count
Example 1: Word Count

Q: Do you see any place we can improve the efficiency?
Local aggregation at mapper will be able to improve MapReduce efficiency.
WordCount: No Combine

WordCount: Combine

MapReduce: Combiner
-
Combiner: do local aggregation/combine task at mapper

-
Q: What are the benefits of using combiner:
- Reduce memory/disk requirement of Map tasks
- Reduce network traffic
-
Q: Can we remove the reduce function?
- No, reducer still needs to process records with same key but from different mappers
-
Q: How would you implement combiner?
- It is the same as Reducer!
Combiner Example

Combine: Bandwidth Optimization
- Combiner is an optimization, not a requirement. (optional)
- allow local aggregation (after mapper) before shuffle & sort (“mini-reducer”)
- Executed on same machine as mapper
- Used as an optimization to reduce network traffic.
Partitioner : Load Balancing
-
System uses a default partition function:
-
Partitioner computes hash value of key, takes mod of value with # of reducers

Partitioner : Load Balancing
- Hopefully same amount of load to each reducer (load balancing)
- But may be Zipfian distribution
- Zipf's law is an empirical law formulated using mathematical statistics
- Ex) given some corpus of natural language,
1st: “the” 2nd : “of ” 3rd: “and”
More on MapReduce