MapReduce, Hadoop, and MapReduceMerge - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

MapReduce, Hadoop, and MapReduceMerge

Description:

Execution overview: map. The user begins a map-reduce job. ... Execution overview: reduce ... If a mapper fails during a reduce phase, both phases are re-executed. ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 37
Provided by: coursesI
Category:

less

Transcript and Presenter's Notes

Title: MapReduce, Hadoop, and MapReduceMerge


1
Map-Reduce,Hadoop,andMap-Reduce-Merge
2
Presentation Overview
  • What is map-reduce?
  • input/output data types
  • why is it useful and where is it used?
  • Execution overview
  • Features
  • fault tolerance
  • ordering guarantee
  • other perks and bonuses
  • Hands-on demonstration and follow-along
  • Map-reduce-merge

3
What is map-reduce?
  • Map-reduce is a programming model (and an
    associated implementation) for processing and
    generating large data sets.
  • It consists of two steps map and reduce.
  • The map step takes a key/value pair and
    produces an intermediate key/value pair.
  • The reduce step takes a key and a list of the
    key's values and outputs the final key/value pair.

4
Types
  • map (k1, v1) ? list(k2, v2)?
  • reduce (k2, list(v2)) ? list(v2)?

5
Why is this useful?
  • Map-reduce jobs are automatically parallelized.
  • Partial failure of the processing cluster is
    expected and tolerable.
  • Redundancy and fault-tolerance is built in, so
    the programmer doesn't have to worry.
  • It scales very well.
  • Many jobs are naturally expressible in the
    map/reduce paradigm.

6
What are some uses?
  • Word count
  • map ltword, 1gt. reduce ltword, gt
  • Grep
  • map ltfile, linegt. reduce identity
  • Inverted index
  • map ltword, docIDgt. reduce ltword, list(docID)gt
  • Distributed sort (special case)?
  • map ltkey, recordgt. reduce identity
  • Users Google, Yahoo!, Amazon, Facebook, etc.

7
Presentation Overview
  • What is map-reduce?
  • input/output data types
  • why is it useful and where is it used?
  • Execution overview
  • Features
  • fault tolerance
  • ordering guarantee
  • other perks and bonuses
  • Hands-on demonstration and follow-along
  • Map-reduce-merge

8
Execution overview map
  • The user begins a map-reduce job. One of the
    machines becomes the master.
  • Partition the input into M splits (16-64 MB each)
    and distribute among the machines. A worker
    reads his split and begins work. Upon
    completion, the worker notifies the master.
  • The master partitions the intermediate keyspace
    into R pieces with a partitioning function.

9
Execution overview reduce
  • When a reduce worker is notified about a job, it
    uses RPC to read the intermediate data from a
    mapper, then sorts it by key.
  • The reducer processes its job, then writes its
    output to the final output file for its reduce
    partition.
  • When all reducers are finished, the master wakes
    up the user program.

10
What are M and R?
  • M is the number of map pieces. R is the number
    of reduce pieces.
  • Ideally, M and R are much larger than the number
    of workers. This allows one machine to perform
    many different tasks, improving load balancing
    and speeds up recovery.
  • The master makes O(MR) scheduling decisions and
    keeps O(MR) states in memory.
  • At least R files end up being written.

11
Example counting words
  • We have UTD's fight song
  • C-O-M-E-T-S! Go!
  • Green, Orange, White!
  • Comets! Go!
  • Strong of will, we fight for right!
  • Let's all show our comet might!
  • We want to count the number of occurrences of
    each word.
  • The next slides show the map and reduce phases.

12
First stage map
  • Go through the input, and for each word return a
    tuple of (ltwordgt, 1).
  • Output
  • ltC-O-M-E-T-S!, 1gt
  • ltGo!, 1gt
  • ltGreen,, 1gt
  • ltOrange,, 1gt
  • ltWhite!, 1gt
  • ltComets!, 1gt
  • ltGo!, 1gt
  • ltStrong, 1gt
  • ltof, 1gt
  • ...

13
Between map and reduce...
  • Between the mapper and the reducer, some gears
    turn within Hadoop, and it groups identical keys
    and sorts by key before starting the reducer.
  • Here's the output
  • ltC-O-M-E-T-S!, 1gt
  • ltComets!, 1gt
  • ltGo!, 1,1gt
  • ltGreen,, 1gt
  • ltOrange,, 1gt
  • ltStrong, 1gt
  • ltWhite!, 1gt
  • ltof, 1gt
  • ...

14
Second stage reducer
  • The reducer receives the content, one
    key-valuelist pair at a time, and does its own
    processing.
  • For wordcount, it sums the values in each list.
  • Here's the output
  • ltC-O-M-E-T-S!, 1gt
  • ltGo!, 2gt
  • ltGreen,, 1gt
  • ltOrange,, 1gt
  • Then it writes these tuples to the final files in
    the HDFS.

15
How can we improve our wordcount?Also, any
questions?
16
Presentation Overview
  • What is map-reduce?
  • input/output data types
  • why is it useful and where is it used?
  • Execution overview
  • Features
  • fault tolerance
  • ordering guarantee
  • other perks and bonuses
  • Hands-on demonstration and follow-along
  • Map-reduce-merge

17
Fault tolerance
  • Worker failure is expected. If a worker fails
    during a map phase, its workload is reassigned to
    another worker. If a mapper fails during a
    reduce phase, both phases are re-executed.
  • Master failure is not expected, though
    checkpointing can be used for recovery.
  • If a particular record causes the mapper or
    reducer to reliably crash, the map-reduce system
    can figure this out, skip the record, and proceed.

18
Ordering guarantee
  • The implementation of map-reduce guarantees that
    within a given partition, the intermediate
    key/value pairs are processed in increasing key
    order.
  • This means that each reduce partition ends up
    with an output file sorted by key.

19
Partitioning function
  • By default, your reduce tasks will be distributed
    evenly by using a hash(intrmdt-key) mod N
    function.
  • You can specify a custom partitioning function.
  • Useful for locality reasons, such as if the key
    is a URL and you want all URLs belonging to a
    single host to be processed on a single machine.

20
Combiner function
  • After a map phase, the mapper transmits over the
    network the entire intermediate data file to the
    reducer.
  • Sometimes this file is highly compressible.
  • The user can specify a combiner function. It's
    just like a reduce function, except it's run by
    the mapper before passing the job to the reducer.

21
Counters
  • A counter can be associated with any action that
    a mapper or a reducer does. This is in addition
    to default counters such as the number of input
    and output key/value pairs processed.
  • A user can watch the counters in real time to
    see the progress of a job.
  • When the map/reduce job finishes, these counters
    are provided to the user program.

22
Presentation Overview
  • What is map-reduce?
  • input/output data types
  • why is it useful and where is it used?
  • Execution overview
  • Features
  • fault tolerance
  • ordering guarantee
  • other perks and bonuses
  • Hands-on demonstration and follow-along
  • Map-reduce-merge

23
What is ?
  • Hadoop is the implementation of the map/reduce
    design that we will use.
  • Hadoop is released under the Apache License 2.0,
    so it's open source.
  • Hadoop uses the Hadoop Distributed File System,
    HDFS. (In contrast to what we've seen with
    Lucene.)?
  • Get the release from
  • http//hadoop.apache.org/core/

24
Preparing Hadoop on your system
  • Configure passwordless public-key SSH on
    localhost
  • Configure Hadoop
  • look at the two configuration files at
    http//utdallas.edu/pmw033000/hadoop/
  • Format the HDFS
  • bin/hadoop namenode -format
  • Start Hadoop
  • cd lthadoop-dirgt
  • bin/start-all.sh (and wait 20 seconds)?

25
Example grep
  • Standard Unix 'grep' behavior run it on the
    command line with the search string as the first
    argument and the list of files or directories as
    the subsequent argument(s).
  • grep HelloWorld file1.c file2.c file3.c
  • file2.cSystem.out.println(I say HelloWorld!)

26
Preparing for 'grep' in Hadoop
  • Hadoop's jobs always operate within the HDFS.
  • Hadoop will read its input from HDFS, and will
    write its output to HDFS.
  • Thus, to prepare
  • Download a free electronic book
  • http//utdallas.edu/pmw033000/hadoop/book.txt
  • Load the file into HDFS
  • bin/hadoop fs -copyFromLocal book.txt /book.txt

27
Using 'grep' within Hadoop
  • bin/hadoop jar \
  • hadoop-0.18-2-examples.jar \
  • grep /book.txt /grep-result \
  • search string
  • bin/hadoop fs -ls /grep-result
  • bin/hadoop fs -cat /grep-result/part-00000
  • A good string to try Horace de \S
  • Between job runs bin/hadoop fs -rmr /grep-result

28
How 'grep' in Hadoop works
  • The program runs two map/reduce jobs in sequence.
    The first job counts how many times a matching
    string occurred and the second job sorts matching
    strings by their frequency and stores the output
    in a single output file.
  • Each mapper of the first job takes a line as
    input and matches the user-provided regular
    expression against the line. It extracts all
    matching strings and emits (matching string, 1)
    pairs. Each reducer sums the frequencies of each
    matching string. The output is sequence files
    containing the matching string and count. The
    reduce phase is optimized by running a combiner
    that sums the frequency of strings from local map
    output. As a result it reduces the amount of data
    that needs to be shipped to a reduce task.
  • The second job takes the output of the first job
    as input. The mapper is an inverse map, while the
    reducer is an identity reducer. The number of
    reducers is one, so the output is stored in one
    file, and it is sorted by the count in a
    descending order. The output file is text, each
    line of which contains count and a matching
    string.

29
Another example word count
  • bin/hadoop jar hadoop-0.18.2-examples.jar \
  • wordcount /book.txt /wc-result
  • bin/hadoop fs -cat /wc-result/part-00000 \
  • sort -n -k 2
  • You can also try passing a -r option to
    increase the number of parallel reducers.
  • Each mapper takes a line as input and breaks it
    into words. It then emits a key/value pair of the
    word and 1. Each reducer sums the counts for each
    word and emits a single key/value with the word
    and sum.
  • As an optimization, the reducer is also used as a
    combiner on the map outputs. This reduces the
    amount of data sent across the network by
    combining each word into a single record.

30
Presentation Overview
  • What is map-reduce?
  • input/output data types
  • why is it useful and where is it used?
  • Execution overview
  • Features
  • fault tolerance
  • ordering guarantee
  • other perks and bonuses
  • Hands-on demonstration and follow-along
  • Map-reduce-merge (proposal not implemented)?

31
Does map-reduce satisfy all needs?
  • Map-reduce is great for homogeneous data, such as
    grepping a large collection of files or
    word-counting a huge document.
  • Joining heterogeneous databases does not work
    well.
  • As is, we'd need additional map-reduce steps,
    such as map-reducing one database and reading
    from the others on the fly.
  • We want to support relational algebra.

32
Solution
  • The solution to these problems is
    map-reduce-merge. It is map-reduce with a new
    additional merging step.
  • The merge phase makes it easier to process data
    relationships among heterogeneous data sets.
  • Types
  • map (k1, v1)a ? (k2, v2)a
  • reduce (k2, v2)a ? (k2, v3)a (notice that
    the output v is a list)?
  • merge ((k2, v3)a, (k3, v4)ß) ? (k4, v5)?
  • If aß, then the merging step performs a
    self-merge (self-join in R.A.).

33
New terms
  • Partition selector determines which data
    partitions produced by reducers should be
    retrieved for merging.
  • Processor user-defined logic of processing data
    from an individual source.
  • Merger user-defined logic of processing data
    merged from two sources where data satisfies a
    merge condition.
  • Configurable iterator next slide.

34
Configurable iterators
  • The map and reduce user-defined functions get one
    iterator for the values.
  • The merge function gets two iterators, one for
    each data source.
  • The iterators do not have to move forward they
    can be instrumented to do whatever the user
    wants.
  • Relational join algorithms have specific patterns
    for the merging step.

35
Map-reduce-merge example
  • Table A emp-id, dept-id, bonus
  • 1, B, 100
  • 1, B, 50
  • 2, A, 0
  • 3, A, 150
  • 3, A, 100

Table B dept-id, bonus-adjust B, 1.15 A, 0.95
Final table emp-id, bonus 2, 0 3, 237.5 1,
172.5
36
Map-reduce-merge diagram
Write a Comment
User Comments (0)
About PowerShow.com