목차

ppt

MapReduce

3 steps of MapReduce

Map:
- Apply a user-written Map function to each input element
Group by key: Sort and shuffle
- System sorts all the key-value pairs by key, and outputs key/(list of values) pairs
Reduce:
- User-written Reduce function is applied to each key/(list of values)

Map-Reduce: A diagram

스크린샷 2023-12-20 오전 5.59.26.png

Map-Reduce: In Parallel

스크린샷 2023-12-20 오전 5.59.54.png

Example Word Count

스크린샷 2023-12-20 오전 6.00.13.png

MapReduce Pattern

스크린샷 2023-12-20 오전 6.00.25.png

MapReduce Phases

스크린샷 2023-12-20 오전 6.00.44.png

Deciding on what will be the key and what will be the value

➔ programmer’s responsibility

MapReduce: Execution Framework

MapReduce Execution Framework takes care of:

Partitioning the input data
Scheduling the program’s execution across a set of machines
Performing the group by key step
- In practice, this is is the bottleneck
Handling machine failures
Managing required inter-machine communication

Example 1: Word Count

스크린샷 2023-12-20 오전 6.02.33.png

Example 1: Word Count

스크린샷 2023-12-20 오전 6.03.02.png

Q: What are the Key and Value Pairs of Map and Reduce? Map: Key=word, Value=1

Reduce: Key=word, Value=aggregated count

Example 1: Word Count

스크린샷 2023-12-20 오전 6.03.42.png

Q: Do you see any place we can improve the efficiency?

Local aggregation at mapper will be able to improve MapReduce efficiency.

WordCount: No Combine

스크린샷 2023-12-20 오전 6.04.16.png

WordCount: Combine

스크린샷 2023-12-20 오전 6.04.34.png

MapReduce: Combiner

Combiner: do local aggregation/combine task at mapper
Q: What are the benefits of using combiner:
- Reduce memory/disk requirement of Map tasks
- Reduce network traffic
Q: Can we remove the reduce function?
- No, reducer still needs to process records with same key but from different mappers
Q: How would you implement combiner?
- It is the same as Reducer!

Combiner Example

스크린샷 2023-12-20 오전 6.05.45.png

Combine: Bandwidth Optimization

Combiner is an optimization, not a requirement. (optional)
allow local aggregation (after mapper) before shuffle & sort (“mini-reducer”)
Executed on same machine as mapper
Used as an optimization to reduce network traffic.

Partitioner : Load Balancing

System uses a default partition function:
Partitioner computes hash value of key, takes mod of value with # of reducers

Partitioner : Load Balancing

Hopefully same amount of load to each reducer (load balancing)
But may be Zipfian distribution
- Zipf's law is an empirical law formulated using mathematical statistics
- Ex) given some corpus of natural language, 1st: “the” 2nd : “of ” 3rd: “and”

More on MapReduce