Advanced topics on Mapreduce with Hadoop presentation

About This Presentation

Transcript and Presenter's Notes

Title: Advanced topics on Mapreduce with Hadoop

1
Advanced topics on Mapreduce with Hadoop

Jiaheng Lu
Department of Computer Science
Renmin University of China
www.jiahenglu.net

2
Outline

Brief Review
Chaining MapReduce Jobs
Join in MapReduce
Bloom Filter

3
Brief Review

A parallel programming framework
Divide and merge

Input data
Mappers
Shuffle
Reducers
Output data
split0
Map task
Reduce task
output0
split1
Map task
Reduce task
output1
split2
Map task
4
Chaining MapReduce jobs

Chaining in a sequence
Chaining with complex dependency
Chaining preprocessing and postprocessing steps

5
Chaining in a sequence

Simple and straightforward
MAP REDUCE MAP REDUCE MAP
Output of last is the input to the next
Similar to pipes

Configuration conf getConf()
JobConf job new JobConf(conf)
job.setJobName("ChainJob")
job.setInputFormat(TextInputFormat.class)
job.setOutputFormat(TextOutputFormat.class)
FileInputFormat.setInputPaths(job, in)
FileOutputFormat.setOutputPath(job, out)
JobConf map1Conf new JobConf(false)
ChainMapper.addMapper(job, Map1.class,
LongWritable.class, Text.class, Text.class,
Text.class, true, map1Conf)

7
Chaining with complex dependency

Jobs are not chained in a linear fashion
Use addDependingJob() method to add dependency
information

x.addDependingJob(y)
8
Chaining preprocessing and postprocessing steps

Example remove stop word in IR
Approaches
Separate inefficient
Chaining those steps into a single job
Use ChainMapper.addMapper() and
ChainReducer.setReducer

Map Reduce Map
9
Join in MapReduce

Reduce-side join
Broadcast join
Map-side filtering and Reduce-side join
A given key
A range from dataset(broadcast)
a Bloom filter

10
Reduce-side join

Map
output ltkey, valuegt
keygtgtjoin key, valuegtgttagged with data source
Reduce
do a full cross-product of values
output the combination results

11
Example
table x
key
value
key
valuelist
a b
1 ab
1 cd
4 ef
output
x ab
x cd
y b
1
x ab
x cd
shuffle()
map()
1
1
a b c
1 ab b
1 cd b
4 ef c
4
x ef
reduce()
join key
table y
key
value
2
y d
tag
a c
1 b
2 d
4 c
1
y b
x ef
y c
map()
4
2
y d
4
y c
12
Broadcast join (replicated join)

Broadcast the smaller table
Do join in Map()
Using distributed cache
DistributedCache.addCacheFile()

13
Map-side filtering and Reduce-side join

Join key student IDs from info
generate IDs file from info
broadcast
join
What if the IDs file cant be stored in memory?
a Bloom Filter

14
A Bloom Filter

Introduction
Implementation of bloom filter
Use in MapReduce join

15
Introduction to Bloom Filter

space-efficient data structure, constant size,
test elements, add(), contains()
no false negatives and a small probability of
false positives

16
Implementation of bloom filter

Apply a bit array
Add elements
generate k indexes
set the k bits to 1
Test elements
generate k indexes
all k bits are 1 gtgt true, not all are 1 gtgt false

17
Example
false positives

v
add x(0,2,6)
add y(0,3,9)
contain m(1,3,9)
contain n(0,2,9)
initial state
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
0
0
0
1
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
?
?
?
?
?
18
Use in MapReduce join

A separate subjob to create a Bloom Filter
Broadcast the Bloom Filter and use in Map() of
join job
drop the useless record, and do join in reduce

19
References

Chunk Lam, Hadoop in action
Jairam Chandar, Join Algorithms using Map/Reduce

20
THANK YOU
21
Hadoop

22
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Advanced topics on Mapreduce with Hadoop PowerPoint PPT Presentation