CS347: Map-Reduce - PowerPoint PPT Presentation

About This Presentation

Title:

CS347: Map-Reduce

Description:

KDD Infrastructure – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 64

Provided by: hec73

Learn more at: http://infolab.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS347: Map-Reduce

1
CS347 Map-Reduce Pig

Hector Garcia-Molina
Stanford University

2
"Big Data" Open Source Systems

Infrastructure for distributed data computations
Map-Reduce, S4, Hyracks, Pregel Storm, Mupet
Components
MemCachedD, ZooKeeper, Kestrel
Data services
Pig, F1 Cassandra, H-Base, Big Table Hive

3
Motivation for Map-Reduce
Recall one of our sort strategies
Local sort
R1
R1
ko
Result
Local sort
R2
R2
k1
Local sort
R3
R3
process data partition
additional processing
4
Another example Asymmetric fragment replicate
join
Local join
Ra
Sa
Rb
Sb
f partition
Result
union
process data partition
additional processing
5
Building Text Index - Part I
original Map-Reduce application....
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
6
Building Text Index - Part II
Merge
IntermediateRuns
Final index
7
Generalizing Map-Reduce
Map
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
8
Generalizing Map-Reduce
Merge
IntermediateRuns
Reduce
Final index
9
Map Reduce

Input Rr1, r2, ...rn, functions M, R
M(ri) ? k1, v1, k2, v2,..
R(ki, valSet) ? ki, valSet
Let S k, v k, v ? M(r) for some r ? R
Let K k k,v ? S, for any v
Let G(k) v k, v ? S
Output k, T k ? K, TR(k, G(k))

S is bag
G is bag
10
References

MapReduce Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat,
available athttp//labs.google.com/papers/mapredu
ce-osdi04.pdf
Pig Latin A Not-So-Foreign Language for Data
Processing, Christopher Olston, Benjamin Reedy,
Utkarsh Srivastavava, Ravi Kumar, Andrew
Tomkins,available athttp//wiki.apache.org/pig/

11
Example Counting Word Occurrences

map(String doc, String value)// doc is document
name// value is document contentfor each word w
in value EmitIntermediate(w, 1)
Example
map(doc, cat dog cat bat dog) emitscat 1,
dog 1, cat 1, bat 1, dog 1

Why does maphave 2 parameters?

12
Example Counting Word Occurrences

reduce(String key, Iterator values)// key is a
word// values is a list of countsint result
0for each v in values result
ParseInt(v)Emit(AsString(result))
Example
reduce(dog, 1 1 1 1) emits 4

should emit (dog, 4)??
13
Google MR Overview
14
Implementation Issues

Combine function
File system
Partition of input, keys
Failures
Backup tasks
Ordering of results

15
Combine Function
worker
cat 1, cat 1, cat 1...
worker
worker
dog 1, dog 1...
Combine is like a local reduce applied before
distribution
worker
cat 3...
worker
worker
dog 2...
16
Distributed File System
reduce worker must be able to access local disks
on map workers
all data transfers are through distributed file
system
any worker must be able to write its part of
answer answer is left as distributed file
worker must be able to access any part of input
file
17
Partition of input, keys

How many workers, partitions of input file?

How many workers? Best to have many splits per
worker Improves load balance if worker fails,
easier to spread its tasks
How many splits?
worker
1
2
3
Should workers be assigned to splits near them?
worker
Similar questions for reduce workers
9
worker
18
Failures

Distributed implementation should produce same
output as would have been produced by a
non-faulty sequential execution of the program.
General strategy Master detects worker failures,
and has work re-done by another worker.

master
ok?
split j
worker
redo j
worker
19
Backup Tasks

Straggler is a machine that takes unusually long
(e.g., bad disk) to finish its work.
A straggler can delay final completion.
When task is close to finishing, master schedules
backup executions for remaining tasks.

Must be able to eliminate redundant results
20
Ordering of Results

Final result (at each node) is in key order

also in key order
k1, v1 k3, v3
k1, T1 k2, T2 k3, T3 k4, T4
21
Example Sorting Records
W1
W5
one or two records for k6?
W2
W3
W6
Map extract k, output k, record
Reduce Do nothing!
22
Other Issues

Skipping bad records
Debugging

23
MR Claimed Advantages

Model easy to use, hides details of
parallelization, fault recovery
Many problems expressible in MR framework
Scales to thousands of machines

24
MR Possible Disadvantages

1-input 2-stage data flow rigid, hard to adapt to
other scenarios
Custom code needs to be written even for the most
common operations, e.g., projection and filtering
Opaque nature of map, reduce functions impedes
optimization

25
Questions

Can MR be made more declarative?
How can we perform joins?
How can we perform approximate grouping?
example for all keys that are similarreduce all
values for those keys

26
Additional Topics

Hadoop open-source Map-Reduce system
Pig Yahoo system that builds on MR but is more
declarative

27
Pig Pig Latin

A layer on top of map-reduce (Hadoop)
Pig is the system
Pig Latin is the query language
Pig Latin is a hybrid between
high-level declarative query language in the
spirit of SQL
low-level, procedural programming à la map-reduce.

28
Example

Table urls (url, category, pagerank)
Find, for each sufficiently large category, the
average pagerank of high-pagerank urls in that
category. In SQL
SELECT category, AVG(pagerank)FROM urls WHERE
pagerank gt 0.2GROUP BY category HAVING COUNT()
gt 106

29
Example in Pig Latin

SELECT category, AVG(pagerank)FROM urls WHERE
pagerank gt 0.2GROUP BY category HAVING COUNT()
gt 106
In Pig Latin
good_urls FILTER urls BY pagerank gt 0.2groups
GROUP good_urls BY categorybig_groups
FILTER groups BY
COUNT(good_urls)gt106output FOREACH big_groups
GENERATE category,
AVG(good_urls.pagerank)

30
good_urls FILTER urls BY pagerank gt 0.2
urls url, category, pagerank
good_urls url, category, pagerank
31
groups GROUP good_urls BY category
good_urls url, category, pagerank
groups category, good_urls
32
big_groups FILTER groups BY COUNT(good_urls)gt1
groups category, good_urls
big_groups category, good_urls
33
output FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank)
big_groups category, good_urls
output category, good_urls
34
Features

Similar to specifying a query execution plan
(i.e., a dataflow graph), thereby making it
easier for programmers to understand and control
how their data processing task is executed.
Support for a flexible, fully nested data model
Extensive support for user-defined functions
Ability to operate over plain input files without
any schema information.
Novel debugging environment useful when dealing
with enormous data sets.

35
Execution Control Good or Bad?

Examplespam_urls FILTER urls BY
isSpam(url)culprit_urls FILTER spam_urls BY
pagerankgt0.8
Should system re-order filters?

36
User Defined Functions

Example
groups GROUP urls BY category
output FOREACH groups GENERATE
category, top10(urls)

should be groups.url ?
.gov (x.fbi.gov, .gov, 0.7) ...
.edu (y.yale.edu, .edu, 0.5) ...
.com (z.cnn.com, .com, 0.9) ...
UDF top10 can return scalar or set
.gov (fbi.gov) (cia.gov) ...
.edu (yale.edu) ...
.com (cnn.com) (ibm.com) ...
37
Data Model

Atom, e.g., alice'
Tuple, e.g., (alice', lakers')
Bag, e.g., (alice', lakers') (alice',
(iPod', apple')
Map, e.g., fan of' ? (lakers') (iPod')
age ? 20

Note Bags can currently only hold tuples. So 1,
2, 3 is stored as (1) (2) (3)
38
Expressions in Pig Latin
Should be(1) (2)
See flattenexamplesahead
39
Specifying Input Data
handle for future use
input file

queries LOAD query_log.txt'USING myLoad()AS
(userId, queryString, timestamp)

custom deserializer
output schema
40
For Each

expanded_queries FOREACH queries GENERATE
userId, expandQuery(queryString)
See example next slide
Note each tuple is processed independently good
for parallelism
To remove one level of nestingexpanded_queries
FOREACH queries GENERATE userId,
FLATTEN(expandQuery(queryString))

41
ForEach and Flattening
lakers rumors isa single string value
plus userid
42
Flattening Example (Fill In)
X A B C
Y FOREACH X GENERATE A, FLATTEN(B), C
43
Flattening Example (Fill In)
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Is ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C) ?
44
Flattening Example
X A B C
Note first tuple is (a1, b1, b2, (c1)(c2))
Y FOREACH X GENERATE A, FLATTEN(B), C
Flatten is not recursive
Note attribute naming gets complicated. For
example, 2 for first tuple is b2 for third
tuple it is (c1)(c2).
45
Flattening Example
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Note that ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C)
46
Filter

real_queries FILTER queries BY userId neq
bot'
real_queries FILTER queries BY NOT
isBot(userId)

UDF function
47
Co-Group

Two data sets for example
results (queryString, url, position)
revenue (queryString, adSlot, amount)
grouped_data COGROUP results BY
queryString, revenue BY queryString
url_revenues FOREACH grouped_data
GENERATEFLATTEN(distributeRevenue(results,
revenue))
Co-Group more flexible than SQL JOIN

48
CoGroup vs Join
49
Group (Simple CoGroup)

grouped_revenue GROUP revenue BY queryString
query_revenues FOREACH grouped_revenue GENERATE
queryString, SUM(revenue.amount) AS totalRevenue

50
CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
51
CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
52
CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 ? X
53
CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 A/B? X
54
CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
55
CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
56
CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
57
CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
58
CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z ? X
59
CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z SUM(A)/A? X
60
Pig Latin Join