|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
|
Distributed Systems
|
|
4005-730-01
|
|
Spring Quarter 2013
|
|
Course Page
|
4005-730 Distributed Systems
Lecture Notes -- Module 8. Map-Reduce Systems
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
- All examples and figures taken from:
- T. White. Hadoop: The Definitive Guide, 2nd Edition. O'Reilly, 2011.
http://www.dilbert.com
The Map-Reduce Paradigm
- Example: Determine the maximum temperature for each year from National Climatic Data Center weather data
- A large quantity of raw data
- YYYY = Year
- TTTTT = Temperature in units of 0.1 C; 9999 means missing
- Q = Quality code
YYYY TTTTTQ
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999
0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999
0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999
0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999
0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999
0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999
0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999
0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999
0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999
Input data
Figure 2-1. MapReduce logical data flow
White, op. cit.
Figure 2-2. MapReduce data flow with a single reduce task
White, op. cit.
Figure 2-3. MapReduce data flow with multiple reduce tasks
White, op. cit.
Figure 2-4. MapReduce data flow with no reduce tasks
White, op. cit.
Hadoop
- Hadoop web site: http://hadoop.apache.org/
- A full-featured, enterprise-scale implementation of map-reduce in Java
- Who uses Hadoop? See http://wiki.apache.org/hadoop/PoweredBy
- Hadoop code for the maximum temperature example
- Demo
- Input files
- Command line
$ hadoop MaxTemperature '*.txt' output
- Output directory
$ ls -l output
total 4
-rwxrwxrwx 1 ark ark 18 2011-10-23 14:27 part-00000
-rwxrwxrwx 1 ark ark 0 2011-10-23 14:27 _SUCCESS
$ cat output/part-00000
1901 317
1902 244
- Combiner (optional)
- Performs the reduction step on the output of each mapper before sending it to the reducer
- Can reduce load on the reducer if there are many mappers
- Hadoop code for the maximum temperature example with a combiner
- Demo
- Command line
$ hadoop MaxTemperatureWithCombiner '*.txt' output2
- Output directory
$ ls -l output2
total 4
-rwxrwxrwx 1 ark ark 18 2011-10-23 14:29 part-00000
-rwxrwxrwx 1 ark ark 0 2011-10-23 14:29 _SUCCESS
$ cat output2/part-00000
1901 317
1902 244
Hadoop Architecture
Figure 6-1. How Hadoop runs a MapReduce Job
White, op. cit.
- Input splits
- Determines the number of mappers and which part of the input data is processed by each mapper
- Splitting is very flexible: by files; by pieces of files; etc.
- TaskTracker
- In charge of map and reduce tasks running on a particular node
- Has a certain number of mapper slots
- Has a certain number of reducer slots
- Picks from available mapper tasks based on proximity of input split storage to task tracker node
- Picks from available reducer tasks arbitrarily
- Speculative execution
- If all tasks for a job have been launched . . .
- And some task is taking a long time to complete . . .
- Then start another copy of the slow task . . .
- Hoping the copy might finish sooner
- When one copy finishes, kill the other copy
- Fault tolerance
- Task failure: Re-run the failed mapper or reducer task; if the same task keeps failing, give up and abort the job
- Task tracker failure: Stop using the failed task tracker; re-run its tasks in a different task tracker
- Job tracker failure: The whole job fails
Hadoop Distributed File System (HDFS)
- Designed for:
- Fast streaming access . . .
- To very large data sets . . .
- Stored on a distributed cluster of nodes . . .
- With redundancy, for greater locality and for fault tolerance
- Default block size: 64 MB
- To reduce the number of disk seeks while streaming
Figure 3-1. A client reading data from HDFS
White, op. cit.
Figure 3-3. A client writing data to HDFS
White, op. cit.
What Map-Reduce Is Good For
- If you have petabytes of data to analyze, just transferring the data off the disk is going to take forever
- Instead:
- Split the data into N pieces, each on a different disk (node)
- Run N mappers in parallel, each on a different node
- Each mapper only has to read 1/N of the data from its own local disk
- Map-reduce versus traditional relational database
| |
|
Map-reduce |
|
RDBMS |
| Data size |
|
Petabytes |
|
Gigabytes |
| Disk access |
|
Streaming (batch) |
|
Random (interactive or batch) |
| Structure |
|
Unstructured or structured |
|
Highly structured |
| Updates |
|
Write once, read many times |
|
Write and read many times |
- Map-reduce versus traditional parallel computing (OpenMP, MPI, etc.)
| |
|
Map-reduce |
|
Parallel computing |
| Type of problems |
|
Mainly data-intensive |
|
Mainly CPU-intensive |
| Thread/process coupling |
|
Minimal |
|
Minimal to maximal |
| Programming patterns |
|
Just one |
|
Anything |
| Programming effort |
|
Small |
|
Medium to large |
| Fault tolerance |
|
Automatic |
|
Manual |
|
Distributed Systems
|
|
4005-730-01
|
|
Spring Quarter 2013
|
|
Course Page
|
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2013 Alan Kaminsky.
All rights reserved.
Last updated 26-Apr-2013.
Please send comments to ark@cs.rit.edu.