목차
ppt
MapReduce = Programming Model + Execution Framework
Cluster Architecture

Cluster Architecture

Simplest environment for parallel processing
- No dependency among data
- Data can be split into lots of smaller chunks
- Each process can work on a chunk
- Master/worker approach
- Master manages assignments to worker
- Workers execute a task assigned to them by the master
- Communication only occurs between master process and worker processes
(No direct worker to worker communication)
MapReduce Execution Framework
- Handles scheduling
- Assigns workers to map and reduce tasks
- Handles data distribution
- Handles synchronization
- Gathers, sorts, and shuffles intermediate data
- Handles errors and faults
- Detects worker failures and restarts
- Everything happens on top of a distributed file system(GFS, HDFS)
Distributed File System
- A Distributed File System ( DFS ) enables programs to store and access remote files exactly as they do local ones, allowing users to access files from any computer on a network.
- File system spread over multiple, autonomous computers.
- A distributed file system should provide:
- Transparency: hide the details of where a file is located.
- High availability: ease of accessibility irrespective of the physical location of the file.
Distributed File System
- This objective is difficult to achieve because the distributed file system is vulnerable to problems in underlying networks as well as crashes of systems that are the file sources.
- Replication can be used to alleviate the above problem.
- However, replication introduces additional issues such as consistency.
- Example
- GFS (Google File System) for Google’s MapReduce
- HDFS (Hadoop Distributed File System) for Hadoop
GFS/HDFS
Distribution of the input
- Input data is partitioned into splits of size S and is processed by M mappers
- splitting the data helps exploit the data level parallelism
- number of map tasks is usually more than the number of available worker machines (better dynamic load balancing)
- splits of size S: typically the size of a distributed file system block
Execution Flow Overview

Overall schematic diagram for MapReduce framework

Master
- Only 1 Master per MapReduce computation
- Master:
- assigns map and reduce tasks to the idle workers
- informs the location of input data to mappers
- stores the state (idle, in-progress, completed) and identity of each worker machine
- for each completed map task, master stores the location and sizes of intermediate files produced by the mapper; this information is pushed to workers which have in- progress reduce tasks
MapReduce: Step-by-Step Execution
- Split the input into M pieces and start copies of program on different machines
- the master assigns work to idle machines
- Map task
- read the input and parse the key/value pairs
- pass each pair to user-defined Map function
- write intermediate key-value pairs to disk in R files partitioned by the partitioning function
- pass location of intermediate files back to master
MapReduce: Step-by-Step Execution
- Master notifies the reduce worker
- Reduction is distributed over R tasks which cover different parts of the intermediate key’s domain
- Reduce task:
- MapReduce completes when all map and reduce tasks have finished
MapReduce: Output
- The output of MapReduce is R output files
(one per reduce task)
- The partitioning function for intermediate keys can be defined by the user
- Result files can be merged or fed to another MapReduce job
Execution Overview
(1)MapReduce splits the Input files into M “splits” then Starts many copies of program on servers

(2) One copy(the master) is special. The rest are workers. The master picks idle workers And assigns each 1 of M map tasks or 1of R reduce tasks.
