목차

ppt

MapReduce = Programming Model + Execution Framework

Cluster Architecture

스크린샷 2023-12-20 오전 6.14.35.png

Cluster Architecture

Untitled

Simplest environment for parallel processing

No dependency among data
Data can be split into lots of smaller chunks
Each process can work on a chunk
Master/worker approach
- Master manages assignments to worker
- Workers execute a task assigned to them by the master
Communication only occurs between master process and worker processes (No direct worker to worker communication)

MapReduce Execution Framework

Handles scheduling
- Assigns workers to map and reduce tasks
Handles data distribution
- Moves processes to data
Handles synchronization
- Gathers, sorts, and shuffles intermediate data
Handles errors and faults
- Detects worker failures and restarts
Everything happens on top of a distributed file system(GFS, HDFS)

Distributed File System

A Distributed File System ( DFS ) enables programs to store and access remote files exactly as they do local ones, allowing users to access files from any computer on a network.
File system spread over multiple, autonomous computers.
A distributed file system should provide:
- Transparency: hide the details of where a file is located.
- High availability: ease of accessibility irrespective of the physical location of the file.

Distributed File System

This objective is difficult to achieve because the distributed file system is vulnerable to problems in underlying networks as well as crashes of systems that are the file sources.
Replication can be used to alleviate the above problem.
However, replication introduces additional issues such as consistency.
Example
- GFS (Google File System) for Google’s MapReduce
- HDFS (Hadoop Distributed File System) for Hadoop

GFS/HDFS

Split data and store 3 replica on commodity servers

Distribution of the input

Input data is partitioned into splits of size S and is processed by M mappers
- splitting the data helps exploit the data level parallelism
number of map tasks is usually more than the number of available worker machines (better dynamic load balancing)
- splits of size S: typically the size of a distributed file system block

Execution Flow Overview

스크린샷 2023-12-20 오전 6.17.38.png

Overall schematic diagram for MapReduce framework

스크린샷 2023-12-20 오전 6.17.55.png

Master

Only 1 Master per MapReduce computation
Master:
- assigns map and reduce tasks to the idle workers
- informs the location of input data to mappers
- stores the state (idle, in-progress, completed) and identity of each worker machine
- for each completed map task, master stores the location and sizes of intermediate files produced by the mapper; this information is pushed to workers which have in- progress reduce tasks

MapReduce: Step-by-Step Execution

Split the input into M pieces and start copies of program on different machines
the master assigns work to idle machines
Map task
- read the input and parse the key/value pairs
- pass each pair to user-defined Map function
- write intermediate key-value pairs to disk in R files partitioned by the partitioning function
- pass location of intermediate files back to master

MapReduce: Step-by-Step Execution

Master notifies the reduce worker
Reduction is distributed over R tasks which cover different parts of the intermediate key’s domain
Reduce task:
MapReduce completes when all map and reduce tasks have finished

MapReduce: Output

The output of MapReduce is R output files (one per reduce task)
The partitioning function for intermediate keys can be defined by the user
Result files can be merged or fed to another MapReduce job

Execution Overview

(1)MapReduce splits the Input files into M “splits” then Starts many copies of program on servers

스크린샷 2023-12-20 오전 6.19.57.png

(2) One copy(the master) is special. The rest are workers. The master picks idle workers And assigns each 1 of M map tasks or 1of R reduce tasks.

스크린샷 2023-12-20 오전 6.20.17.png