Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs - PowerPoint PPT Presentation

About This Presentation

Title:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs

Description:

(Large SMP) Impact of Memory ... parallelization techniques Detailed analytical model capturing memory hierarchy and parallelization overheads Evaluated on ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 26

Provided by: Renat157

Learn more at: http://web.cse.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs

1
Performance Prediction for Random Write
Reductions A Case Study in Modelling Shared
Memory Programs

Ruoming Jin
Gagan Agrawal
Department of Computer and Information Sciences
Ohio State University

2
Outline

Motivation
Random Write Reductions and Parallelization
Techniques
Problem Definition
Analytical Model
General Approach
Modeling Cache and TLB
Modeling waiting for locks and memory contention
Experimental Validation
Conclusions

3
Motivation

Frequently need to mine very large datasets
Large and powerful SMP machines are becoming
available
Vendors often target data mining and data
warehousing as the main market
Data mining emerging as an important class of
applications for SMP machines

4
Common Processing Structure

Structure of Common Data Mining Algorithms
Outer Sequential Loop
While ()
Reduction Loop
Foreach (element e)
(i,val) process(e)
Reduc(i) Reduc(i) op
val
Applies to major association mining, clustering
and decision tree construction algorithms
How to parallelize it on a shared memory
machine?

5
Challenges in Parallelization

Statically partitioning the reduction object to
avoid race conditions is generally impossible.
Runtime preprocessing or scheduling also cannot
be applied
Cant tell what you need to update w/o processing
the element
The size of reduction object means significant
memory overheads for replication
Locking and synchronization costs could be
significant because of the fine-grained updates
to the reduction object.

6
Parallelization Techniques

Full Replication create a copy of the reduction
object for each thread
Full Locking associate a lock with each element
Optimized Full Locking put the element and
corresponding lock on the same cache block
Cache Sensitive Locking one lock for all
elements in a cache block

7
Memory Layout for Locking Schemes
Optimized Full Locking
Cache-Sensitive Locking
Lock
Reduction Element
8
Relative Experimental Performance
Different Techniques can outperform each other
depending upon problem and machine parameters
9
Problem Definition

Can we predict the relative performance of
different techniques for given machine, algorithm
and dataset parameters ?
Develop an analytical model capturing the impact
of memory hierarchy and modeling different
parallelization overheads
Other applications of the model
Predicting speedups possible on parallel
configurations
Predicting performance as the output size is
increased
Scheduling and QoS in multiprogrammed
environments
Choosing accuracy of analysis and sampling rate
in an interactive environment or when mining over
data streams

10
Context

Part of the FREERIDE (Framework for Rapid
Implementation of Datamining Engines) system
Support parallelization on shared-nothing
configurations
Support parallelization on shared memory
configurations
Support processing of large datasets
Previously reported our work on parallelization
techniques and processing of disk-resident
datasets (SDM 01, SDM 02)

11
Analytical Model Overview

Input data is read from disks constant
processing time
Reduction elements are accessed randomly their
size can vary considerably
Factors to model
Cache Misses on reduction elements -gt Capacity
and Coherence
TLB Misses on reduction elements
Waiting time for locks
Memory contention

12
Basic Approach

Focus on modeling reduction loops
Tloop Taverage N
Taverage Tcompute Treduc
Treduc Tupdate Twait Tcache_miss
Ttlb_miss
Tmemory_contention
Tupdate can be computed by executing the
loop with a reduction object that fits into L1
cache

13
Modeling Waiting time for Locks

The spent by a thread in one iteration of the
loop can be divided into three components
Computing independently (a)
Waiting for a lock (Twait)
Holding a lock (b)
where b Treduc - Twait
Each lock is a M/D/1 queue
The rate at which each requests to acquire a lock
are issued are
l t / ((a b Twait)m)

14
Modeling Waiting Time for Locks

Standard result on M/D/1 queues
Twait bU/ 2(1-U)
where, U is the server utilization and is
given by
U l b
Result on Twait is
Twait b/(2(a/b 1)(m/t) 1)

15
Modeling Memory Hierarchy

Need to model
L1 and L2 Cache
L2 Cache
TLB Misses
Ignore cold misses
Only consider directly-mapped cache analyze
capacity and conflict misses together
Simple analysis for capacity and conflict misses
because of random accesses to the reduction
object

16
Modeling Coherence Cache Misses

A coherence miss occurs when a cache block is
invalidated by other CPU
Analyze the probability that
Between two accesses to a cache block on a
processor,
the same memory block is accessed, and this
memory block is
not updated by one of the other processors in the
mean-time
Details are available in the paper

17
Modeling Memory Contention

Input elements displace reduction objects from
cache
Results in a write-back followed by read
operation
Memory system on many machines requires extra
cycles to switch between write-back and read
operations
Source of contention
Model using M/D/1 queues, similar to waiting time
for locks

18
Experimental Platform

Small SMP machine
Sun Ultra Enterprise 450
4 X 250 MHz Ultra-II processors
1 GB of 4-way interleaved main memory
Large SMP machine
Sun Fire 6800
24 X 900 MHz Sun UltraSparc III
A 96KB L1 cache and a 64 MB L2 cache per
processor
24 GB main memory

19
Impact of Memory Hierarchy, Large SMP
Measured and predicted performance as the size
of reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
20
Modeling Parallel Performance with Locking,
Large SMP
Parallel performance with cache-sensitive
locking, small reduction object sizes 1 thread
2 threads 4 threads 8 threads 12 threads
21
Modeling Parallel Performance, Large SMP
Performance of optimized full locking with large
reduction object sizes 1 thread 2 threads
4 threads 8 threads 12 threads
22
How good is the Model in Predicting Relative
Performance ? (Large SMP)
Performance of Optimized full locking and
Cache Sensitive Locking (12 threads)
23
Impact of Memory Hierarchy, Small SMP
Measured and predicted performance as the size of
reduction object is scaled Full replication
Optimized full locking Cache-sensitive locking
24
Parallel Performance, Small SMP
Performance of optimized full locking 1
thread 2 threads 3 threads
25
Summary

A new application of performance modeling
Choosing among different parallelization
techniques
Detailed analytical model capturing memory
hierarchy and parallelization overheads
Evaluated on two different SMP machines
Predicted performance within 20 in almost all
cases
Effectively capture impact of both memory
hierarchy and parallelization overheads
Quite accurate in predicting the relative
performance of different techniques