MapReduce for the Cell B. E. Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

MapReduce for the Cell B. E. Architecture

Description:

Department of Computer Science. MapReduce for the Cell B. E. Architecture. Marc ... Distributed grep. Indexing. Simple, high-level interface. Runtime handles: ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 28
Provided by: marcde1
Category:

less

Transcript and Presenter's Notes

Title: MapReduce for the Cell B. E. Architecture


1
MapReduce for the Cell B. E. Architecture
Marc de Kruijf University of Wisconsin-Madison Ad
vised by Professor Sankaralingam
2
MapReduce
  • A model for parallel programming
  • Proposed by Google
  • Large scale distributed systems 1,000 node
    clusters
  • Applications
  • Distributed sort
  • Distributed grep
  • Indexing
  • Simple, high-level interface
  • Runtime handles
  • parallelization, scheduling, synchronization, and
    communication

3
Cell B. E. Architecture
  • A heterogeneous computing platform
  • 1 PPE, 8 SPEs
  • Programming is hard
  • Multi-threading is explicit
  • SPE local memories are software-managed
  • The Cell is like a cluster-on-a-chip

4
Motivation
  • MapReduce
  • Scalable parallel model
  • Simple interface
  • Cell B. E.
  • Complex parallel architecture
  • Hard to program

MapReduce for the Cell B.E. Architecture
5
Overview
  • Motivation
  • MapReduce
  • Cell B.E. Architecture
  • MapReduce Example
  • Design
  • Evaluation
  • Workload Characterization
  • Application Performance
  • Conclusions and Future Work

6
MapReduce Example
  • Counting word occurrences in a set of documents

7
Overview
  • Motivation
  • MapReduce
  • Cell B.E. Architecture
  • MapReduce Example
  • Design
  • Evaluation
  • Workload Characterization
  • Application Performance
  • Conclusions and Future Work

8
Design
Flow of Execution Five stages Map, Partition,
Quick-sort, Merge-sort, Reduce
9
Design
Flow of Execution Five stages Map, Partition,
Quick-sort, Merge-sort, Reduce 1. Map streams
key/value pairs
10
Design
  • Flow of Execution
  • Five stages Map, Partition, Quick-sort,
    Merge-sort, Reduce
  • 1. Map streams key/value pairs
  • Key grouping implemented as
  • 2. Partition hash and distribute
  • 3. Quick-sort
  • 4. Merge-sort

two-phase external sort
11
Design
  • Flow of Execution
  • Five stages Map, Partition, Quick-sort,
    Merge-sort, Reduce
  • 1. Map streams key/value pairs
  • Key grouping implemented as
  • 2. Partition hash and distribute
  • 3. Quick-sort
  • 4. Merge-sort

two-phase external sort
12
Design
  • Flow of Execution
  • Five stages Map, Partition, Quick-sort,
    Merge-sort, Reduce
  • 1. Map streams key/value pairs
  • Key grouping implemented as
  • 2. Partition hash and distribute
  • 3. Quick-sort
  • 4. Merge-sort

two-phase external sort
13
Design
  • Flow of Execution
  • Five stages Map, Partition, Quick-sort,
    Merge-sort, Reduce
  • 1. Map streams key/value pairs
  • Key grouping implemented as
  • 2. Partition hash and distribute
  • 3. Quick-sort
  • 4. Merge-sort
  • 5. Reduce reduces
  • key/list-of-values pairs to
  • key/value pairs.

two-phase external sort
14
Overview
  • Motivation
  • MapReduce
  • Cell B.E. Architecture
  • MapReduce Example
  • Design
  • Evaluation
  • Workload Characterization
  • Application Performance
  • Conclusions and Future Work

15
Evaluation Methodology
  • MapReduce Model Characterization
  • Synthetic micro-benchmark with six parameters
  • Run on a 3.2 GHz Cell Blade
  • Measured effect of each parameter on execution
    time
  • Application Performance Comparison
  • Six full applications
  • MapReduce versions run on 3.2 GHz Cell Blade
  • Single-threaded versions run on 2.4 GHz Core 2
    Duo
  • Evaluation
  • Measured speedup comparing execution times
  • Measured overheads on the Cell monitoring SPE
    idle time
  • Measured ideal speedup assuming no Cell overheads

16
MapReduce Model Characterization
  • Model Characteristics

Effect on Execution Time
Characteristic Description
Map intensity Execution cycles per input byte to Map
Reduce intensity Execution cycles per input byte to Reduce
Map fan-out Ratio of input size to output size in Map
Reduce fan-in Number of values per key in Reduce
Partitions Number of partitions
Input size Input size in bytes
17
Application Performance
  • Applications
  • histogram counts bitmap RGB occurrences
  • kmeans clustering algorithm
  • linearReg least-squares linear regression
  • wordCount word count
  • NAS_EP EP benchmark from NAS suite
  • distSort distributed sort

18
Speedup Over Core 2 Duo
19
Runtime Overheads
20
Overview
  • Motivation
  • MapReduce
  • Cell B.E. Architecture
  • MapReduce Example
  • Design
  • Evaluation
  • Workload Characterization
  • Application Performance
  • Conclusions and Future Work

21
Conclusions and Future Work
  • Conclusions
  • Programmability benefits
  • High-performance on computationally intensive
    workloads
  • Not applicable to all application types
  • Future Work
  • Additional performance tuning
  • Extend for clusters of Cell processors
  • Hierarchical MapReduce

22
Questions?
23
Backup Slides
24
MapReduce API
  • void MapReduce_exec(MapReduce Specification
    specification)
  • The exec function initializes the MapReduce
    runtime and executes MapReduce according to the
    user specification.
  • void MapReduce_emitIntermediate(void key, void
    value)
  • void MapReduce_emit(void value)
  • These two functions are called by the
    user-defined Map and Reduce functions,
    respectively. These functions take references to
    pointers as arguments, and modify the referenced
    pointer to point to pre-allocated storage. It is
    then the responsibility of the application to
    provision this storage.

25
Optimizations
  • Priority work queue
  • Distributes load
  • Avoids serialization
  • Pipelined execution maximizes concurrency
  • Double-buffering
  • Application support
  • Map only
  • Map with sorted output
  • Chaining invocations

26
Optimizations
  • Priority work queue
  • Distributes load
  • Avoids serialization
  • Pipelined execution maximizes concurrency
  • Double-buffering
  • Application support
  • Map only
  • Map with sorted output
  • Chaining invocations

27
Optimizations
  • Balanced merge (n / log(n) better bandwidth
    utilization as n ? 8)
  • Map and Reduce output regions pre-allocated.
  • optimal memory alignment
  • bulk memory transfers
  • no user memory management
  • no dynamic allocation overhead
Write a Comment
User Comments (0)
About PowerShow.com