목차
ppt
What is MapReduce?
- MapReduce is
- a programming model for expressing distributed computations on massive datasets, and
- an execution framework (a.k.a runtime) for large-scale data processing on commodity clusters
- Developed at Google in 2004 ( Jeffrey Dean & Sanjay Ghemawat )
What is MapReduce?
- It is a programming model that processes large data by:
- apply a function to each logical record in the input (map)
- categorize and combine the intermediate results into summary values (reduce)
- Google’s MapReduce is inspired by map and reduce functions in functional programming language LISP.
MapReduce vs. Hadoop
- MapReduce is a programming model for writing and executing applications that require processing massive amounts of data + an execution framework for large-scale data processing on commodity clusters
- Originally developed at Google
- Hadoop is an open-source implementation of MapReduce, whose development was led by Yahoo Research
(and is now part of the Apache project)
- Rapidly expanding software ecosystem
Hadoop
-
Created in 2005 by Doug Cutting and Mike Cafarella
-
Originally developed for Nutch, an open source search
engine
-
Hadoop has no meaning, it was the name of Cutting’s son’s toy elephant
-
Based on Map Reduce, proposed by Google,
-
Open source through Apache
-
Many Big Data tools built on or around Hadoop
MapReduce vs. Hadoop
|
MapReduce |
Hadoop |
Organization |
Google |
Yahoo/Apache |
Implementation Language |
C++ |
Java |
Distributed File System |
GFS |
|
(Google File System) |
HDFS |
|
(Hadoop Distributed File System) |
|
|
Database |
Bigtable |
HBase |
Distributed lock manager |
Chubby |
ZooKeeper |
Apache Hadoop History
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI'04: Sixth Symposium on Operating System Design and Implementation. December, 2004.
MapReduce
- MapReduce was invented at Google to compute the PageRank
- The PageRank algorithm is at the guts of Google's search algorithm
- They need an efficient, effective way to compute the PageRank for a crawled set of websites on a cluster of machines
- MapReduce was designed to address this problem
What is MapReduce used for?
- Google
- Index construction for Google Search • Article clustering for Google News
- Statistical machine translation ...
- Yahoo!
- Index construction for Yahoo! Search • Spam detection for Yahoo Mail
- Facebook
- Web log processing
- Ad. optimization
- Spam detection
Why did MapReduce become so popular?
-
Distributed Computation Before MapReduce
- how to divide the workload among multiple machines?
- how to distribute data and program to other machines?
- how to schedule tasks?
- what happens if a task fails while running? ....
-
Distributed Computation After MapReduce
- how to write Map function?
- how to write Reduce function?
→ MapReduce lowered the knowledge barrier in distributed computation.
MapReduce
Programming Model + Execution Framework
Programming Model
- Simple Model
- Programmer only describes logic
Execution Framework
- Works on commodity cluster
- Scales thousands of machines
- Hide all hard system problems from the programmer
− Scheduling
− Datadistribution
− Machinefailure
−...
MapReduce Design Goals
- Scalability to large data volumes:
- 1000’s of machines, 10,000’s of disks
- Cost‐efficiency:
- Commodity machines
- Commodity network
- Automatic fault‐tolerance
- Easy to use
“Low-end server platform is about 4 times more cost efficient than a high-end shared memory platform from the same vendor.”
Big Ideas behind MapReduce
- Scale Out Instead of Scaling Up
- large number of commodity low-end servers is more cost-effective for data-intensive applications
- Failures are Common
- Suppose a server is built using machines with MTBF(mean-time between failures) of 1,000 days
- For a 10,000 server cluster, there are on average 10 failures per day!
- MapReduce implementation copes with failures
(Data is replicated / Automatic task restarts)
Big Ideas behind MapReduce
- Move Code to the Data
- Code transfer is much cheaper than transferring massive amounts of data.
- Idea: move code to where the data reside (Exploit Locality)
- Process Data Sequentially(Avoid Random Access)
- Random accesses to data stored on disks are much cost than sequential accesses.
- Disk seek times are determined by mechanical factors
Big Ideas behind MapReduce
- Hide System-Level Details from Programmers
- Provide a simple abstraction
- Programmer defines what computations are
- MapReduce execution framework takes care of how the computations are carried out