MapReduce is a programming model for writing and executing applications that require processing massive amounts of data + an execution framework for large-scale data processing on commodity clusters
Originally developed at Google
Hadoop is an open-source implementation of MapReduce, whose development was led by Yahoo Research (and is now part of the Apache project)
- Rapidly expanding software ecosystem

Hadoop

MapReduce vs. Hadoop

	MapReduce	Hadoop
Organization	Google	Yahoo/Apache
Implementation Language	C++	Java
Distributed File System	GFS
(Google File System)	HDFS
(Hadoop Distributed File System)
Database	Bigtable	HBase
Distributed lock manager	Chubby	ZooKeeper

스크린샷 2023-12-20 오전 5.37.24.png

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

OSDI'04: Sixth Symposium on Operating System Design and Implementation. December, 2004.

MapReduce was invented at Google to compute the PageRank
The PageRank algorithm is at the guts of Google's search algorithm
They need an efficient, effective way to compute the PageRank for a crawled set of websites on a cluster of machines
MapReduce was designed to address this problem

Google
- Index construction for Google Search • Article clustering for Google News
- Statistical machine translation ...
Yahoo!
- Index construction for Yahoo! Search • Spam detection for Yahoo Mail
Facebook
- Data mining,
Web log processing
- Ad. optimization
- Spam detection

Distributed Computation Before MapReduce
- how to divide the workload among multiple machines?
- how to distribute data and program to other machines?
- how to schedule tasks?
- what happens if a task fails while running? ....
Distributed Computation After MapReduce
- how to write Map function?
- how to write Reduce function?
→ MapReduce lowered the knowledge barrier in distributed computation.

Programming Model + Execution Framework

Programming Model

Execution Framework

Works on commodity cluster
Scales thousands of machines
Hide all hard system problems from the programmer − Scheduling − Datadistribution − Machinefailure −...

“Low-end server platform is about 4 times more cost efficient than a high-end shared memory platform from the same vendor.”