CS347: Map-Reduce - PowerPoint PPT Presentation

About This Presentation
Title:

CS347: Map-Reduce

Description:

KDD Infrastructure – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 64
Provided by: hec73
Category:
Tags: cs347 | hive | map | reduce

less

Transcript and Presenter's Notes

Title: CS347: Map-Reduce


1
CS347 Map-Reduce Pig
  • Hector Garcia-Molina
  • Stanford University

2
"Big Data" Open Source Systems
  • Infrastructure for distributed data computations
  • Map-Reduce, S4, Hyracks, Pregel Storm, Mupet
  • Components
  • MemCachedD, ZooKeeper, Kestrel
  • Data services
  • Pig, F1 Cassandra, H-Base, Big Table Hive

3
Motivation for Map-Reduce
Recall one of our sort strategies
Local sort
R1
R1
ko
Result
Local sort
R2
R2
k1
Local sort
R3
R3
process data partition
additional processing
4
Another example Asymmetric fragment replicate
join
Local join
Ra
Sa
Rb
Sb
f partition
Result
union
process data partition
additional processing
5
Building Text Index - Part I
original Map-Reduce application....
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
6
Building Text Index - Part II
Merge
IntermediateRuns
Final index
7
Generalizing Map-Reduce
Map
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
8
Generalizing Map-Reduce
Merge
IntermediateRuns
Reduce
Final index
9
Map Reduce
  • Input Rr1, r2, ...rn, functions M, R
  • M(ri) ? k1, v1, k2, v2,..
  • R(ki, valSet) ? ki, valSet
  • Let S k, v k, v ? M(r) for some r ? R
  • Let K k k,v ? S, for any v
  • Let G(k) v k, v ? S
  • Output k, T k ? K, TR(k, G(k))

S is bag
G is bag
10
References
  • MapReduce Simplified Data Processing on Large
    Clusters, Jeffrey Dean and Sanjay Ghemawat,
    available athttp//labs.google.com/papers/mapredu
    ce-osdi04.pdf
  • Pig Latin A Not-So-Foreign Language for Data
    Processing, Christopher Olston, Benjamin Reedy,
    Utkarsh Srivastavava, Ravi Kumar, Andrew
    Tomkins,available athttp//wiki.apache.org/pig/

11
Example Counting Word Occurrences
  • map(String doc, String value)// doc is document
    name// value is document contentfor each word w
    in value EmitIntermediate(w, 1)
  • Example
  • map(doc, cat dog cat bat dog) emitscat 1,
    dog 1, cat 1, bat 1, dog 1
  • Why does maphave 2 parameters?

12
Example Counting Word Occurrences
  • reduce(String key, Iterator values)// key is a
    word// values is a list of countsint result
    0for each v in values result
    ParseInt(v)Emit(AsString(result))
  • Example
  • reduce(dog, 1 1 1 1) emits 4

should emit (dog, 4)??
13
Google MR Overview
14
Implementation Issues
  • Combine function
  • File system
  • Partition of input, keys
  • Failures
  • Backup tasks
  • Ordering of results

15
Combine Function
worker
cat 1, cat 1, cat 1...
worker
worker
dog 1, dog 1...
Combine is like a local reduce applied before
distribution
worker
cat 3...
worker
worker
dog 2...
16
Distributed File System
reduce worker must be able to access local disks
on map workers
all data transfers are through distributed file
system
any worker must be able to write its part of
answer answer is left as distributed file
worker must be able to access any part of input
file
17
Partition of input, keys
  • How many workers, partitions of input file?

How many workers? Best to have many splits per
worker Improves load balance if worker fails,
easier to spread its tasks
How many splits?
worker
1
2
3
Should workers be assigned to splits near them?
worker
Similar questions for reduce workers
9
worker
18
Failures
  • Distributed implementation should produce same
    output as would have been produced by a
    non-faulty sequential execution of the program.
  • General strategy Master detects worker failures,
    and has work re-done by another worker.

master
ok?
split j
worker
redo j
worker
19
Backup Tasks
  • Straggler is a machine that takes unusually long
    (e.g., bad disk) to finish its work.
  • A straggler can delay final completion.
  • When task is close to finishing, master schedules
    backup executions for remaining tasks.

Must be able to eliminate redundant results
20
Ordering of Results
  • Final result (at each node) is in key order

also in key order
k1, v1 k3, v3
k1, T1 k2, T2 k3, T3 k4, T4
21
Example Sorting Records
W1
W5
one or two records for k6?
W2
W3
W6
Map extract k, output k, record
Reduce Do nothing!
22
Other Issues
  • Skipping bad records
  • Debugging

23
MR Claimed Advantages
  • Model easy to use, hides details of
    parallelization, fault recovery
  • Many problems expressible in MR framework
  • Scales to thousands of machines

24
MR Possible Disadvantages
  • 1-input 2-stage data flow rigid, hard to adapt to
    other scenarios
  • Custom code needs to be written even for the most
    common operations, e.g., projection and filtering
  • Opaque nature of map, reduce functions impedes
    optimization

25
Questions
  • Can MR be made more declarative?
  • How can we perform joins?
  • How can we perform approximate grouping?
  • example for all keys that are similarreduce all
    values for those keys

26
Additional Topics
  • Hadoop open-source Map-Reduce system
  • Pig Yahoo system that builds on MR but is more
    declarative

27
Pig Pig Latin
  • A layer on top of map-reduce (Hadoop)
  • Pig is the system
  • Pig Latin is the query language
  • Pig Latin is a hybrid between
  • high-level declarative query language in the
    spirit of SQL
  • low-level, procedural programming à la map-reduce.

28
Example
  • Table urls (url, category, pagerank)
  • Find, for each sufficiently large category, the
    average pagerank of high-pagerank urls in that
    category. In SQL
  • SELECT category, AVG(pagerank)FROM urls WHERE
    pagerank gt 0.2GROUP BY category HAVING COUNT()
    gt 106

29
Example in Pig Latin
  • SELECT category, AVG(pagerank)FROM urls WHERE
    pagerank gt 0.2GROUP BY category HAVING COUNT()
    gt 106
  • In Pig Latin
  • good_urls FILTER urls BY pagerank gt 0.2groups
    GROUP good_urls BY categorybig_groups
    FILTER groups BY
    COUNT(good_urls)gt106output FOREACH big_groups
    GENERATE category,
    AVG(good_urls.pagerank)

30
good_urls FILTER urls BY pagerank gt 0.2
urls url, category, pagerank
good_urls url, category, pagerank
31
groups GROUP good_urls BY category
good_urls url, category, pagerank
groups category, good_urls
32
big_groups FILTER groups BY COUNT(good_urls)gt1
groups category, good_urls
big_groups category, good_urls
33
output FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank)
big_groups category, good_urls
output category, good_urls
34
Features
  • Similar to specifying a query execution plan
    (i.e., a dataflow graph), thereby making it
    easier for programmers to understand and control
    how their data processing task is executed.
  • Support for a flexible, fully nested data model
  • Extensive support for user-defined functions
  • Ability to operate over plain input files without
    any schema information.
  • Novel debugging environment useful when dealing
    with enormous data sets.

35
Execution Control Good or Bad?
  • Examplespam_urls FILTER urls BY
    isSpam(url)culprit_urls FILTER spam_urls BY
    pagerankgt0.8
  • Should system re-order filters?

36
User Defined Functions
  • Example
  • groups GROUP urls BY category
  • output FOREACH groups GENERATE
    category, top10(urls)

should be groups.url ?
.gov (x.fbi.gov, .gov, 0.7) ...
.edu (y.yale.edu, .edu, 0.5) ...
.com (z.cnn.com, .com, 0.9) ...
UDF top10 can return scalar or set
.gov (fbi.gov) (cia.gov) ...
.edu (yale.edu) ...
.com (cnn.com) (ibm.com) ...
37
Data Model
  • Atom, e.g., alice'
  • Tuple, e.g., (alice', lakers')
  • Bag, e.g., (alice', lakers') (alice',
    (iPod', apple')
  • Map, e.g., fan of' ? (lakers') (iPod')
    age ? 20

Note Bags can currently only hold tuples. So 1,
2, 3 is stored as (1) (2) (3)
38
Expressions in Pig Latin
Should be(1) (2)
See flattenexamplesahead
39
Specifying Input Data
handle for future use
input file
  • queries LOAD query_log.txt'USING myLoad()AS
    (userId, queryString, timestamp)

custom deserializer
output schema
40
For Each
  • expanded_queries FOREACH queries GENERATE
    userId, expandQuery(queryString)
  • See example next slide
  • Note each tuple is processed independently good
    for parallelism
  • To remove one level of nestingexpanded_queries
    FOREACH queries GENERATE userId,
    FLATTEN(expandQuery(queryString))

41
ForEach and Flattening
lakers rumors isa single string value
plus userid
42
Flattening Example (Fill In)
X A B C
Y FOREACH X GENERATE A, FLATTEN(B), C
43
Flattening Example (Fill In)
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Is ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C) ?
44
Flattening Example
X A B C
Note first tuple is (a1, b1, b2, (c1)(c2))
Y FOREACH X GENERATE A, FLATTEN(B), C
Flatten is not recursive
Note attribute naming gets complicated. For
example, 2 for first tuple is b2 for third
tuple it is (c1)(c2).
45
Flattening Example
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Note that ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C)
46
Filter
  • real_queries FILTER queries BY userId neq
    bot'
  • real_queries FILTER queries BY NOT
    isBot(userId)

UDF function
47
Co-Group
  • Two data sets for example
  • results (queryString, url, position)
  • revenue (queryString, adSlot, amount)
  • grouped_data COGROUP results BY
    queryString, revenue BY queryString
  • url_revenues FOREACH grouped_data
    GENERATEFLATTEN(distributeRevenue(results,
    revenue))
  • Co-Group more flexible than SQL JOIN

48
CoGroup vs Join
49
Group (Simple CoGroup)
  • grouped_revenue GROUP revenue BY queryString
  • query_revenues FOREACH grouped_revenue GENERATE
    queryString, SUM(revenue.amount) AS totalRevenue

50
CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
51
CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
52
CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 ? X
53
CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 A/B? X
54
CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
55
CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
56
CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
57
CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
58
CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z ? X
59
CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z SUM(A)/A? X
60
Pig Latin Join
  • join_result JOIN results BYqueryString,
    revenue BY queryString
  • Shorthand for
  • temp_var COGROUP results BY queryString,revenue
    BY queryString
  • join_result FOREACH temp_var GENERATEFLATTEN(re
    sults), FLATTEN(revenue)

61
MapReduce in Pig Latin
  • map_result FOREACH input GENERATE
    FLATTEN(map())
  • key_groups GROUP map_result BY 0
  • output FOREACH key_groups
    GENERATE reduce()

key is first attribute
all attributes
62
Store
  • To materialize result in a file
  • STORE query_revenuesINTO myoutput' USING
    myStore()

output file
custom serializer
63
Hadoop
  • HDFS Hadoop file system
  • How to use Hadoop, examples
Write a Comment
User Comments (0)
About PowerShow.com