Title: Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs
1Performance Prediction for Random Write
Reductions A Case Study in Modelling Shared
Memory Programs
- Ruoming Jin
- Gagan Agrawal
- Department of Computer and Information Sciences
- Ohio State University
2Outline
- Motivation
- Random Write Reductions and Parallelization
Techniques - Problem Definition
- Analytical Model
- General Approach
- Modeling Cache and TLB
- Modeling waiting for locks and memory contention
- Experimental Validation
- Conclusions
3Motivation
- Frequently need to mine very large datasets
- Large and powerful SMP machines are becoming
available - Vendors often target data mining and data
warehousing as the main market - Data mining emerging as an important class of
applications for SMP machines
4Common Processing Structure
- Structure of Common Data Mining Algorithms
- Outer Sequential Loop
- While ()
- Reduction Loop
- Foreach (element e)
- (i,val) process(e)
- Reduc(i) Reduc(i) op
val -
-
- Applies to major association mining, clustering
and decision tree construction algorithms - How to parallelize it on a shared memory
machine? -
-
5Challenges in Parallelization
- Statically partitioning the reduction object to
avoid race conditions is generally impossible.
- Runtime preprocessing or scheduling also cannot
be applied - Cant tell what you need to update w/o processing
the element - The size of reduction object means significant
memory overheads for replication - Locking and synchronization costs could be
significant because of the fine-grained updates
to the reduction object.
6Parallelization Techniques
- Full Replication create a copy of the reduction
object for each thread - Full Locking associate a lock with each element
- Optimized Full Locking put the element and
corresponding lock on the same cache block - Cache Sensitive Locking one lock for all
elements in a cache block
7Memory Layout for Locking Schemes
Optimized Full Locking
Cache-Sensitive Locking
Lock
Reduction Element
8Relative Experimental Performance
Different Techniques can outperform each other
depending upon problem and machine parameters
9Problem Definition
- Can we predict the relative performance of
different techniques for given machine, algorithm
and dataset parameters ? - Develop an analytical model capturing the impact
of memory hierarchy and modeling different
parallelization overheads - Other applications of the model
- Predicting speedups possible on parallel
configurations - Predicting performance as the output size is
increased - Scheduling and QoS in multiprogrammed
environments - Choosing accuracy of analysis and sampling rate
in an interactive environment or when mining over
data streams
10Context
- Part of the FREERIDE (Framework for Rapid
Implementation of Datamining Engines) system - Support parallelization on shared-nothing
configurations - Support parallelization on shared memory
configurations - Support processing of large datasets
- Previously reported our work on parallelization
techniques and processing of disk-resident
datasets (SDM 01, SDM 02)
11Analytical Model Overview
- Input data is read from disks constant
processing time - Reduction elements are accessed randomly their
size can vary considerably - Factors to model
- Cache Misses on reduction elements -gt Capacity
and Coherence - TLB Misses on reduction elements
- Waiting time for locks
- Memory contention
12Basic Approach
- Focus on modeling reduction loops
- Tloop Taverage N
-
- Taverage Tcompute Treduc
- Treduc Tupdate Twait Tcache_miss
- Ttlb_miss
Tmemory_contention - Tupdate can be computed by executing the
loop with a reduction object that fits into L1
cache -
-
13Modeling Waiting time for Locks
- The spent by a thread in one iteration of the
loop can be divided into three components - Computing independently (a)
- Waiting for a lock (Twait)
- Holding a lock (b)
- where b Treduc - Twait
- Each lock is a M/D/1 queue
- The rate at which each requests to acquire a lock
are issued are - l t / ((a b Twait)m)
14Modeling Waiting Time for Locks
- Standard result on M/D/1 queues
- Twait bU/ 2(1-U)
- where, U is the server utilization and is
given by - U l b
- Result on Twait is
- Twait b/(2(a/b 1)(m/t) 1)
15Modeling Memory Hierarchy
- Need to model
- L1 and L2 Cache
- L2 Cache
- TLB Misses
- Ignore cold misses
- Only consider directly-mapped cache analyze
capacity and conflict misses together - Simple analysis for capacity and conflict misses
because of random accesses to the reduction
object
16Modeling Coherence Cache Misses
- A coherence miss occurs when a cache block is
invalidated by other CPU - Analyze the probability that
- Between two accesses to a cache block on a
processor, - the same memory block is accessed, and this
memory block is - not updated by one of the other processors in the
mean-time - Details are available in the paper
-
17Modeling Memory Contention
- Input elements displace reduction objects from
cache - Results in a write-back followed by read
operation - Memory system on many machines requires extra
cycles to switch between write-back and read
operations - Source of contention
- Model using M/D/1 queues, similar to waiting time
for locks
18Experimental Platform
- Small SMP machine
- Sun Ultra Enterprise 450
- 4 X 250 MHz Ultra-II processors
- 1 GB of 4-way interleaved main memory
- Large SMP machine
- Sun Fire 6800
- 24 X 900 MHz Sun UltraSparc III
- A 96KB L1 cache and a 64 MB L2 cache per
processor - 24 GB main memory
19Impact of Memory Hierarchy, Large SMP
Measured and predicted performance as the size
of reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
20Modeling Parallel Performance with Locking,
Large SMP
Parallel performance with cache-sensitive
locking, small reduction object sizes 1 thread
2 threads 4 threads 8 threads 12 threads
21Modeling Parallel Performance, Large SMP
Performance of optimized full locking with large
reduction object sizes 1 thread 2 threads
4 threads 8 threads 12 threads
22How good is the Model in Predicting Relative
Performance ? (Large SMP)
Performance of Optimized full locking and
Cache Sensitive Locking (12 threads)
23Impact of Memory Hierarchy, Small SMP
Measured and predicted performance as the size of
reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
24Parallel Performance, Small SMP
Performance of optimized full locking 1
thread 2 threads 3 threads
25Summary
- A new application of performance modeling
- Choosing among different parallelization
techniques - Detailed analytical model capturing memory
hierarchy and parallelization overheads - Evaluated on two different SMP machines
- Predicted performance within 20 in almost all
cases - Effectively capture impact of both memory
hierarchy and parallelization overheads - Quite accurate in predicting the relative
performance of different techniques