목차
ppt
MapReduce
3 steps of MapReduce
- Map:
- Apply a user-written Map function to each input element
- Group by key: Sort and shuffle
- System sorts all the key-value pairs by key, and outputs key/(list of values) pairs
- Reduce:
- User-written Reduce function is applied to each key/(list of values)
Map-Reduce: A diagram
Map-Reduce: In Parallel
Example Word Count
MapReduce Pattern
MapReduce Phases
Deciding on what will be the key and what will be the value
➔ programmer’s responsibility
MapReduce: Execution Framework
MapReduce Execution Framework takes care of:
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the group by key step
- In practice, this is is the bottleneck
- Handling machine failures
- Managing required inter-machine communication
Example 1: Word Count
Example 1: Word Count
Q: What are the Key and Value Pairs of Map and Reduce?
Map: Key=word, Value=1
Reduce: Key=word, Value=aggregated count
Example 1: Word Count
Q: Do you see any place we can improve the efficiency?
Local aggregation at mapper will be able to improve MapReduce efficiency.
WordCount: No Combine
WordCount: Combine
MapReduce: Combiner
-
Combiner: do local aggregation/combine task at mapper
-
Q: What are the benefits of using combiner:
- Reduce memory/disk requirement of Map tasks
- Reduce network traffic
-
Q: Can we remove the reduce function?
- No, reducer still needs to process records with same key but from different mappers
-
Q: How would you implement combiner?
- It is the same as Reducer!
Combiner Example
Combine: Bandwidth Optimization
- Combiner is an optimization, not a requirement. (optional)
- allow local aggregation (after mapper) before shuffle & sort (“mini-reducer”)
- Executed on same machine as mapper
- Used as an optimization to reduce network traffic.
Partitioner : Load Balancing
-
System uses a default partition function:
-
Partitioner computes hash value of key, takes mod of value with # of reducers
Partitioner : Load Balancing
- Hopefully same amount of load to each reducer (load balancing)
- But may be Zipfian distribution
- Zipf's law is an empirical law formulated using mathematical statistics
- Ex) given some corpus of natural language,
1st: “the” 2nd : “of ” 3rd: “and”
More on MapReduce
- Apply mapper to every input key-value pair stored in DFS
- Generate arbitrary number of intermediate (k,v)
- Aggregate locally
- Assign to reducers
- Distributed group by operation (shuffle) on intermediate keys
- Sort intermediate results by key (not across reducers)
- Aggregate intermediate results
- Generate final output to DFS (one file per reducer)
Example 2: Counting URLs
- Find the count of each URL in web logs
-
The map function processes logs of web page requests and outputs (URL, 1).
-
The reduce function adds together all values for the same URL and emits a (URL, total count) pair.
Example 3: Count URL access frequency
Example 4: Stock Summary
-
****Find average daily gain of each company from 1/1/2000 ~ 12/31/2015
-
Data is a set of lines: { date, company, start_price, end_price }
Example 5: Average salaries in regions
- Show zip codes where average salaries are in the ranges: (1) < $100K (2) $100K ... $500K (3) > $500K
- Data is a set of lines: { name, age, address, zip, salary }
Example 6: Find Common Friends