Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs - PowerPoint PPT Presentation

About This Presentation
Title:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs

Description:

(Large SMP) Impact of Memory ... parallelization techniques Detailed analytical model capturing memory hierarchy and parallelization overheads Evaluated on ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 26
Provided by: Renat157
Category:

less

Transcript and Presenter's Notes

Title: Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs


1
Performance Prediction for Random Write
Reductions A Case Study in Modelling Shared
Memory Programs
  • Ruoming Jin
  • Gagan Agrawal
  • Department of Computer and Information Sciences
  • Ohio State University

2
Outline
  • Motivation
  • Random Write Reductions and Parallelization
    Techniques
  • Problem Definition
  • Analytical Model
  • General Approach
  • Modeling Cache and TLB
  • Modeling waiting for locks and memory contention
  • Experimental Validation
  • Conclusions

3
Motivation
  • Frequently need to mine very large datasets
  • Large and powerful SMP machines are becoming
    available
  • Vendors often target data mining and data
    warehousing as the main market
  • Data mining emerging as an important class of
    applications for SMP machines

4
Common Processing Structure
  • Structure of Common Data Mining Algorithms
  • Outer Sequential Loop
  • While ()
  • Reduction Loop
  • Foreach (element e)
  • (i,val) process(e)
  • Reduc(i) Reduc(i) op
    val
  • Applies to major association mining, clustering
    and decision tree construction algorithms
  • How to parallelize it on a shared memory
    machine?

5
Challenges in Parallelization
  • Statically partitioning the reduction object to
    avoid race conditions is generally impossible.
  • Runtime preprocessing or scheduling also cannot
    be applied
  • Cant tell what you need to update w/o processing
    the element
  • The size of reduction object means significant
    memory overheads for replication
  • Locking and synchronization costs could be
    significant because of the fine-grained updates
    to the reduction object.

6
Parallelization Techniques
  • Full Replication create a copy of the reduction
    object for each thread
  • Full Locking associate a lock with each element
  • Optimized Full Locking put the element and
    corresponding lock on the same cache block
  • Cache Sensitive Locking one lock for all
    elements in a cache block

7
Memory Layout for Locking Schemes
Optimized Full Locking
Cache-Sensitive Locking
Lock
Reduction Element
8
Relative Experimental Performance
Different Techniques can outperform each other
depending upon problem and machine parameters
9
Problem Definition
  • Can we predict the relative performance of
    different techniques for given machine, algorithm
    and dataset parameters ?
  • Develop an analytical model capturing the impact
    of memory hierarchy and modeling different
    parallelization overheads
  • Other applications of the model
  • Predicting speedups possible on parallel
    configurations
  • Predicting performance as the output size is
    increased
  • Scheduling and QoS in multiprogrammed
    environments
  • Choosing accuracy of analysis and sampling rate
    in an interactive environment or when mining over
    data streams

10
Context
  • Part of the FREERIDE (Framework for Rapid
    Implementation of Datamining Engines) system
  • Support parallelization on shared-nothing
    configurations
  • Support parallelization on shared memory
    configurations
  • Support processing of large datasets
  • Previously reported our work on parallelization
    techniques and processing of disk-resident
    datasets (SDM 01, SDM 02)

11
Analytical Model Overview
  • Input data is read from disks constant
    processing time
  • Reduction elements are accessed randomly their
    size can vary considerably
  • Factors to model
  • Cache Misses on reduction elements -gt Capacity
    and Coherence
  • TLB Misses on reduction elements
  • Waiting time for locks
  • Memory contention

12
Basic Approach
  • Focus on modeling reduction loops
  • Tloop Taverage N
  • Taverage Tcompute Treduc
  • Treduc Tupdate Twait Tcache_miss
  • Ttlb_miss
    Tmemory_contention
  • Tupdate can be computed by executing the
    loop with a reduction object that fits into L1
    cache

13
Modeling Waiting time for Locks
  • The spent by a thread in one iteration of the
    loop can be divided into three components
  • Computing independently (a)
  • Waiting for a lock (Twait)
  • Holding a lock (b)
  • where b Treduc - Twait
  • Each lock is a M/D/1 queue
  • The rate at which each requests to acquire a lock
    are issued are
  • l t / ((a b Twait)m)

14
Modeling Waiting Time for Locks
  • Standard result on M/D/1 queues
  • Twait bU/ 2(1-U)
  • where, U is the server utilization and is
    given by
  • U l b
  • Result on Twait is
  • Twait b/(2(a/b 1)(m/t) 1)

15
Modeling Memory Hierarchy
  • Need to model
  • L1 and L2 Cache
  • L2 Cache
  • TLB Misses
  • Ignore cold misses
  • Only consider directly-mapped cache analyze
    capacity and conflict misses together
  • Simple analysis for capacity and conflict misses
    because of random accesses to the reduction
    object

16
Modeling Coherence Cache Misses
  • A coherence miss occurs when a cache block is
    invalidated by other CPU
  • Analyze the probability that
  • Between two accesses to a cache block on a
    processor,
  • the same memory block is accessed, and this
    memory block is
  • not updated by one of the other processors in the
    mean-time
  • Details are available in the paper

17
Modeling Memory Contention
  • Input elements displace reduction objects from
    cache
  • Results in a write-back followed by read
    operation
  • Memory system on many machines requires extra
    cycles to switch between write-back and read
    operations
  • Source of contention
  • Model using M/D/1 queues, similar to waiting time
    for locks

18
Experimental Platform
  • Small SMP machine
  • Sun Ultra Enterprise 450
  • 4 X 250 MHz Ultra-II processors
  • 1 GB of 4-way interleaved main memory
  • Large SMP machine
  • Sun Fire 6800
  • 24 X 900 MHz Sun UltraSparc III
  • A 96KB L1 cache and a 64 MB L2 cache per
    processor
  • 24 GB main memory

19
Impact of Memory Hierarchy, Large SMP
Measured and predicted performance as the size
of reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
20
Modeling Parallel Performance with Locking,
Large SMP
Parallel performance with cache-sensitive
locking, small reduction object sizes 1 thread
2 threads 4 threads 8 threads 12 threads
21
Modeling Parallel Performance, Large SMP
Performance of optimized full locking with large
reduction object sizes 1 thread 2 threads
4 threads 8 threads 12 threads
22
How good is the Model in Predicting Relative
Performance ? (Large SMP)
Performance of Optimized full locking and
Cache Sensitive Locking (12 threads)
23
Impact of Memory Hierarchy, Small SMP
Measured and predicted performance as the size of
reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
24
Parallel Performance, Small SMP
Performance of optimized full locking 1
thread 2 threads 3 threads
25
Summary
  • A new application of performance modeling
  • Choosing among different parallelization
    techniques
  • Detailed analytical model capturing memory
    hierarchy and parallelization overheads
  • Evaluated on two different SMP machines
  • Predicted performance within 20 in almost all
    cases
  • Effectively capture impact of both memory
    hierarchy and parallelization overheads
  • Quite accurate in predicting the relative
    performance of different techniques
Write a Comment
User Comments (0)
About PowerShow.com