Characterizing and Benchmarking MapReduce Workloads - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Characterizing and Benchmarking MapReduce Workloads

Description:

Performance of M-R is difficult to predict. Distribute, Map, Shuffle, Reduce, Collect ... Shuffle stresses the network, non-uniform access ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 10
Provided by: randy63
Category:

less

Transcript and Presenter's Notes

Title: Characterizing and Benchmarking MapReduce Workloads


1
Characterizing and Benchmarking Map-Reduce
Workloads
  • Eric Saxe, Sun
  • Andy Konwinski, RadLab
  • Haruki Oh, RadLab
  • Jay Wylie, HP Labs
  • Matei Zahoria, RadLab
  • Owen OMalley, Yahoo!
  • Khalid Elmeleegy, Yahoo!
  • Randy Katz, RadLab

2
Whats The Problem?
  • Performance of M-R is difficult to predict
  • Distribute, Map, Shuffle, Reduce, Collect
  • Random distribution is good breaks up hot spots
  • Random distribution is bad disrupts locality,
    caching
  • Map exploits parallel processing speedup
  • Shuffle stresses the network, non-uniform access
  • Operational issues asymmetric loading, laggards,
    virtualization all affect performance
    predictability

3
Whats the Problem?
  • Developers of M-R code are not the developers of
    the applications that use the M-R code
  • Difficult for developers to obtain realistic
    workload information, instrument the operational
    environment, access the (huge) application
    datasets
  • How do you determine if your modification to the
    M-R code base is likely to improve performance?
  • How do you configure the cluster environment to
    run the M-R workload well?

4
Whats the Problem?
  • Develop models of M-R execution, parameterized by
    workload description and hardware configuration
  • Given configuration and (benchmark) workload,
    predict performance, compare before and after
    algorithmic changes, e.g., scheduling algorithms
  • Given workload, generate configuration that
    appropriate balances CPU Mem Disk Net
    Storage System that achieves high utilization and
    acceptable performance
  • Topology knobs nodes, disks, interconnect

5
Performance Metrics
  • (Batch) job is the primitive
  • Large dataset analytics/machine learning/pattern
    extraction
  • N-terabyte log processing (e.g., Chukwa),
    aggregation/summarization
  • Periodic processing runs to collect and summarize
    data (e.g., financial processing)
  • N-step transitive closure of social graph
  • Sort-Merge
  • Metric 1 Per job finishing time/fixed finishing
    time
  • Metric 2 Job thruput
  • Metrics not current considered, but could be
    energy efficiency

6
State of M-R Benchmarking
  • GridMix
  • Small number of relevant M-R application cores
  • Sorts, data scans, monster query, random writer
  • Problems
  • High variation in performance from run-to-run
  • Does not scale work to the configuration

7
Some Ideas for Investigation
  • Workload Descriptions
  • More realistic components added to GridMix
  • Profile real workloads in terms of resource
    intensity over time/phase
  • K1 CPU(t) K2 Mem(t) K3 Disk(t) K4
    Net(t)where t M-R phase
  • Synthetically generate this pattern to drive
    algorithmic studies and configurations

8
Some Ideas for Investigation
  • Scalable Benchmarks

Per Job Finishing Time
Workload (jobs/time)
9
Some Ideas for Investigation
Configuration
Performance Goal
Model
Model
Workload
Predict Performance
Suggest Configuration
Workload
Statistical bounds on performance prediction
Write a Comment
User Comments (0)
About PowerShow.com