Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons

About This Presentation

Title:

Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons

Description:

Estimating performance of an application in dynamically changing grid environment. ... PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 28

Provided by: www249

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons

1
Replicating Memory Behavior for Performance
Skeletons
By
Aditya Toomula PC-Doctor Inc. Reno, NV
Jaspal Subhlok University of Houston Houston, TX
2
Resource Selection for Grid Applications
Model
Data
GUI
Sim 1
Pre
Stream
Application
?
where is the best performance
Network
3
Motivation

Estimating performance of an application in
dynamically changing grid environment.
- Estimation based on generic system probes
(like NWS) is expensive and error prone.
Estimating performance for micro-architectural
simulations.
- Executing full application is prohibitively
expensive.

4
Our Approach
Application
Model
Data
Network
GUI
Sim 1
Pre
Stream
PREDICT APPLICATION PERFORMANCE BY RUNNING A
SMALL PROGRAM REPRESENTATIVE OF ACTUAL
DISTRIBUTED APPLICATION
5
Performance Skeletons

A synthetically generated short running program.
Skeleton reflects the performance of the
application it represents in any execution
scenario.
- E.g. skeleton execution time is always
1/1000th of application execution time
An application and its skeleton should have
similar execution activities for the above to be
true
- Communication activity
- CPU activity
- Memory access pattern

6
Memory Skeleton
Given an executable application, construct a
short running skeleton program whose memory
access behavior is representative of the
application.

An application and its memory skeleton should
have similar cache
performance for any cache hierarchy.
Solution approach create a program that
recreates memory accesses in a sequence of
representative slices of the executing program

7
Challenges in creating a Memory Skeleton

Memory trace is prohibitively large even for a
few minutes of execution
solution approach sampling and compression
lossy if necessary
Recreating memory accesses from a trace is
difficult
cache is corrupted by management code
recreation has substantial overhead several
instructions have to be executed to issue a
memory access request
solution approach avoid cache corruption and
allow reordering that would minimize overhead per
access

8
Memory Access Behavior of Applications

Two types of locality.
Spatial Locality - if one memory location is
accessed then nearby memory locations are also
likely to be accessed.
Temporal Locality - if something is accessed
once, it is likely to be accessed again soon.

These locality principles should be preserved in
the memory skeleton
9
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
10
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
11
Address Trace Collection

Link the application executable with Valgrind
Tool
- Generates address trace of the application.
- Access to source code not required.
Issues
- Unacceptable level of storage space and time
overhead.
Hence, sampling of address trace must be done
during trace collection itself collection of
full traces of applications is prohibitively
expensive.

12
Address Trace Collection (Contd)

Trace Sampling
- Divide the trace into trace slices set of
consecutive memory references.
- tool can be periodically switched on and off
to capture these slices.
- slices can be collected at random or uniform
intervals.
Slice size should be at least one order of
magnitude greater than the largest cache
expected, in order to capture the temporal
locality.

13
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
14
Trace Compaction

Recorded trace is still large and expensive to
recreate
Compress the trace using the following two ideas
- Exact address in a trace is not critical a
nearby address will work may affect spatial
locality
- Slight reordering of address trace does not
affect performance. may affect temporal
locality
This is lossy compression but impact on locality
can be reduced to be negligible

15
Trace Compaction (contd)

Divide the address space into lines of size
typical cache line - record only the line number,
not full address.
Impact on spatial locality should be minimal
Divide the temporal sequence of line numbers into
clusters. Reordering within a cluster is allowed.
Cluster size should be much smaller than the
smallest expected cache size so temporal
locality is not affected by reordering

16
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
17
Memory Skeleton Generation

Create a C-program that synthetically generates a
sequence
of memory references recorded in the previous
step.
Challenges
Minimizing extraneous address references.
- Any executing program has memory accesses of
its own.
- Generate a loop structure for each cluster.
- Reading linenumber-frequency pairs once leads
to series of actual memory references from the
trace without intervening address reads.

18
Skeleton Generation (contd)

Eliminating cache corruption
- Reading trace data from disk impacts the
memory simulation.
Use 2nd machine, read data on one machine and
send it through sockets to the other machine
where simulation is done.
Socket buffer is kept to a very small size
Also reduces overhead on the main simulation
machine.

19
Skeleton Generation (contd)

Allocating memory
- The regions of virtual memory that will
actually be used are not known prior to
simulation
Dynamic block allocation
Substantial size block of memory is allocated
when an address reference is made to a location
that is not allocated.
Maintain Sparse Index Table to access the
blocks.

20
Experiments and Results

Skeletons constructed for Class-W NAS serial
benchmarks
(and Class-A IS benchmark).
Experiments conducted on Intel Xeon dual CPU 1.7
GHz machines with 256KB 8-way set associative L2
cache and 64 byte lines, running Linux 2.4.7.
Objectives
Prediction of cache miss ratio of corresponding
applications.
Predictions across different memory hierarchies.

Trace slices picked uniformly throughout the
trace for all the experiments.
21
Prediction of Cache Miss Ratios
Trace Sampling ratio 10
Trace slice size gt 10 million references No. of
slices picked gt 10
Average error lt 5
Comparison of data cache miss rate of benchmarks
and corresponding memory skeletons.
IS application is an exception.
22
Impact of trace slice selection

Traces for IS, BT and MG benchmarks divided into
100 uniform slices.

10 different versions of skeletons generated
each using different sets of 10 uniformly spaced
trace slices.

- MG and BT have similar cache miss rates in all
cases.
- IS has significant variation in cache miss
prediction with different sets of slices.
Reason IS execution goes through different
phases with different memory access behavior
unlike CG and BT.
Data cache miss rates for different sets of
trace slices in skeletons.
Actual Data cache miss rates of IS 3.9 BT
2.76 MG 1.57
23
Impact of trace slice selection (contd)
- Greater the number of trace slices in an
application trace, smaller the size of each trace
slice.
True cache miss ratio
- Having large number of slices in the trace
captures the multiphase behavior of applications.
Data cache miss rates of IS skeletons with
different sets of slices and for different
number of slices.
24
Impact of trace slice size
- The cache miss ratio prediction error increases
rapidly when the slice sizes are reduced below a
certain point.
Error in cache miss prediction with memory
skeletons for different trace slice sizes for MG.
25
Prediction across hardware platforms
Cache miss ratios were predicted fairly
accurately with error lt 5 across all machines.
Cache miss comparison of CG benchmark and its
skeleton across different memory hierarchies.
26
Conclusion and Discussions

Presents a methodology to build memory skeletons
for prediction of application cache miss ratio
across hardware platforms.
A step towards building good performance
skeletons
Extends our groups previous work on skeletons to
memory characteristics.
Major Contribution
Low overhead generation of memory accesses from a
trace.

27
Conclusion and Discussions (Contd)

Limitations
Instruction references
Space and time Overhead
Timing accuracy
Integration with Communication and CPU events

Write a Comment

User Comments (0)

About PowerShow.com

Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons - PowerPoint PPT Presentation

Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons

Estimating performance of an application in dynamically changing grid environment. ... PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ... – PowerPoint PPT presentation