Title: Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons
1Replicating Memory Behavior for Performance
Skeletons
By
Aditya Toomula PC-Doctor Inc. Reno, NV
Jaspal Subhlok University of Houston Houston, TX
2Resource Selection for Grid Applications
Model
Data
GUI
Sim 1
Pre
Stream
Application
?
where is the best performance
Network
3Motivation
- Estimating performance of an application in
dynamically changing grid environment. - - Estimation based on generic system probes
(like NWS) is expensive and error prone. - Estimating performance for micro-architectural
simulations. - - Executing full application is prohibitively
expensive.
4Our Approach
Application
Model
Data
Network
GUI
Sim 1
Pre
Stream
PREDICT APPLICATION PERFORMANCE BY RUNNING A
SMALL PROGRAM REPRESENTATIVE OF ACTUAL
DISTRIBUTED APPLICATION
5Performance Skeletons
- A synthetically generated short running program.
- Skeleton reflects the performance of the
application it represents in any execution
scenario. - - E.g. skeleton execution time is always
1/1000th of application execution time - An application and its skeleton should have
similar execution activities for the above to be
true - - Communication activity
- - CPU activity
- - Memory access pattern
6Memory Skeleton
Given an executable application, construct a
short running skeleton program whose memory
access behavior is representative of the
application.
- An application and its memory skeleton should
have similar cache - performance for any cache hierarchy.
- Solution approach create a program that
recreates memory accesses in a sequence of
representative slices of the executing program
7Challenges in creating a Memory Skeleton
- Memory trace is prohibitively large even for a
few minutes of execution - solution approach sampling and compression
lossy if necessary - Recreating memory accesses from a trace is
difficult - cache is corrupted by management code
- recreation has substantial overhead several
instructions have to be executed to issue a
memory access request - solution approach avoid cache corruption and
allow reordering that would minimize overhead per
access
8Memory Access Behavior of Applications
- Two types of locality.
- Spatial Locality - if one memory location is
accessed then nearby memory locations are also
likely to be accessed. -
- Temporal Locality - if something is accessed
once, it is likely to be accessed again soon.
These locality principles should be preserved in
the memory skeleton
9Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
10Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
11Address Trace Collection
- Link the application executable with Valgrind
Tool - - Generates address trace of the application.
- - Access to source code not required.
- Issues
- - Unacceptable level of storage space and time
overhead. - Hence, sampling of address trace must be done
during trace collection itself collection of
full traces of applications is prohibitively
expensive.
12Address Trace Collection (Contd)
- Trace Sampling
- - Divide the trace into trace slices set of
consecutive memory references. - - tool can be periodically switched on and off
to capture these slices. - - slices can be collected at random or uniform
intervals. - Slice size should be at least one order of
magnitude greater than the largest cache
expected, in order to capture the temporal
locality.
13Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
14Trace Compaction
- Recorded trace is still large and expensive to
recreate - Compress the trace using the following two ideas
- - Exact address in a trace is not critical a
nearby address will work may affect spatial
locality - - Slight reordering of address trace does not
affect performance. may affect temporal
locality -
- This is lossy compression but impact on locality
can be reduced to be negligible
15Trace Compaction (contd)
- Divide the address space into lines of size
typical cache line - record only the line number,
not full address. - Impact on spatial locality should be minimal
- Divide the temporal sequence of line numbers into
clusters. Reordering within a cluster is allowed. - Cluster size should be much smaller than the
smallest expected cache size so temporal
locality is not affected by reordering
16Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
17Memory Skeleton Generation
- Create a C-program that synthetically generates a
sequence - of memory references recorded in the previous
step. - Challenges
- Minimizing extraneous address references.
- - Any executing program has memory accesses of
its own. - - Generate a loop structure for each cluster.
- - Reading linenumber-frequency pairs once leads
to series of actual memory references from the
trace without intervening address reads.
18Skeleton Generation (contd)
- Eliminating cache corruption
- - Reading trace data from disk impacts the
memory simulation. - Use 2nd machine, read data on one machine and
send it through sockets to the other machine
where simulation is done. - Socket buffer is kept to a very small size
- Also reduces overhead on the main simulation
machine. -
-
19Skeleton Generation (contd)
-
- Allocating memory
- - The regions of virtual memory that will
actually be used are not known prior to
simulation -
- Dynamic block allocation
- Substantial size block of memory is allocated
when an address reference is made to a location
that is not allocated. - Maintain Sparse Index Table to access the
blocks. -
20Experiments and Results
- Skeletons constructed for Class-W NAS serial
benchmarks - (and Class-A IS benchmark).
- Experiments conducted on Intel Xeon dual CPU 1.7
GHz machines with 256KB 8-way set associative L2
cache and 64 byte lines, running Linux 2.4.7. - Objectives
- Prediction of cache miss ratio of corresponding
applications. - Predictions across different memory hierarchies.
Trace slices picked uniformly throughout the
trace for all the experiments.
21Prediction of Cache Miss Ratios
Trace Sampling ratio 10
Trace slice size gt 10 million references No. of
slices picked gt 10
Average error lt 5
Comparison of data cache miss rate of benchmarks
and corresponding memory skeletons.
IS application is an exception.
22Impact of trace slice selection
- Traces for IS, BT and MG benchmarks divided into
100 uniform slices.
- 10 different versions of skeletons generated
each using different sets of 10 uniformly spaced
trace slices.
- MG and BT have similar cache miss rates in all
cases.
- IS has significant variation in cache miss
prediction with different sets of slices.
Reason IS execution goes through different
phases with different memory access behavior
unlike CG and BT.
Data cache miss rates for different sets of
trace slices in skeletons.
Actual Data cache miss rates of IS 3.9 BT
2.76 MG 1.57
23Impact of trace slice selection (contd)
- Greater the number of trace slices in an
application trace, smaller the size of each trace
slice.
True cache miss ratio
- Having large number of slices in the trace
captures the multiphase behavior of applications.
Data cache miss rates of IS skeletons with
different sets of slices and for different
number of slices.
24Impact of trace slice size
- The cache miss ratio prediction error increases
rapidly when the slice sizes are reduced below a
certain point.
Error in cache miss prediction with memory
skeletons for different trace slice sizes for MG.
25Prediction across hardware platforms
Cache miss ratios were predicted fairly
accurately with error lt 5 across all machines.
Cache miss comparison of CG benchmark and its
skeleton across different memory hierarchies.
26Conclusion and Discussions
- Presents a methodology to build memory skeletons
for prediction of application cache miss ratio
across hardware platforms. - A step towards building good performance
skeletons - Extends our groups previous work on skeletons to
memory characteristics. - Major Contribution
- Low overhead generation of memory accesses from a
trace.
27Conclusion and Discussions (Contd)
- Limitations
- Instruction references
- Space and time Overhead
- Timing accuracy
- Integration with Communication and CPU events