Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons - PowerPoint PPT Presentation

About This Presentation
Title:

Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons

Description:

Estimating performance of an application in dynamically changing grid environment. ... PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 28
Provided by: www249
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Replicating%20Memory%20Behavior%20for%20Performance%20Skeletons


1
Replicating Memory Behavior for Performance
Skeletons
By
Aditya Toomula PC-Doctor Inc. Reno, NV
Jaspal Subhlok University of Houston Houston, TX
2
Resource Selection for Grid Applications
Model
Data
GUI
Sim 1
Pre
Stream
Application
?
where is the best performance
Network
3
Motivation
  • Estimating performance of an application in
    dynamically changing grid environment.
  • - Estimation based on generic system probes
    (like NWS) is expensive and error prone.
  • Estimating performance for micro-architectural
    simulations.
  • - Executing full application is prohibitively
    expensive.

4
Our Approach
Application
Model
Data
Network
GUI
Sim 1
Pre
Stream
PREDICT APPLICATION PERFORMANCE BY RUNNING A
SMALL PROGRAM REPRESENTATIVE OF ACTUAL
DISTRIBUTED APPLICATION
5
Performance Skeletons
  • A synthetically generated short running program.
  • Skeleton reflects the performance of the
    application it represents in any execution
    scenario.
  • - E.g. skeleton execution time is always
    1/1000th of application execution time
  • An application and its skeleton should have
    similar execution activities for the above to be
    true
  • - Communication activity
  • - CPU activity
  • - Memory access pattern

6
Memory Skeleton
Given an executable application, construct a
short running skeleton program whose memory
access behavior is representative of the
application.
  • An application and its memory skeleton should
    have similar cache
  • performance for any cache hierarchy.
  • Solution approach create a program that
    recreates memory accesses in a sequence of
    representative slices of the executing program

7
Challenges in creating a Memory Skeleton
  • Memory trace is prohibitively large even for a
    few minutes of execution
  • solution approach sampling and compression
    lossy if necessary
  • Recreating memory accesses from a trace is
    difficult
  • cache is corrupted by management code
  • recreation has substantial overhead several
    instructions have to be executed to issue a
    memory access request
  • solution approach avoid cache corruption and
    allow reordering that would minimize overhead per
    access

8
Memory Access Behavior of Applications
  • Two types of locality.
  • Spatial Locality - if one memory location is
    accessed then nearby memory locations are also
    likely to be accessed.
  • Temporal Locality - if something is accessed
    once, it is likely to be accessed again soon.

These locality principles should be preserved in
the memory skeleton
9
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
10
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
11
Address Trace Collection
  • Link the application executable with Valgrind
    Tool
  • - Generates address trace of the application.
  • - Access to source code not required.
  • Issues
  • - Unacceptable level of storage space and time
    overhead.
  • Hence, sampling of address trace must be done
    during trace collection itself collection of
    full traces of applications is prohibitively
    expensive.

12
Address Trace Collection (Contd)
  • Trace Sampling
  • - Divide the trace into trace slices set of
    consecutive memory references.
  • - tool can be periodically switched on and off
    to capture these slices.
  • - slices can be collected at random or uniform
    intervals.
  • Slice size should be at least one order of
    magnitude greater than the largest cache
    expected, in order to capture the temporal
    locality.

13
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
14
Trace Compaction
  • Recorded trace is still large and expensive to
    recreate
  • Compress the trace using the following two ideas
  • - Exact address in a trace is not critical a
    nearby address will work may affect spatial
    locality
  • - Slight reordering of address trace does not
    affect performance. may affect temporal
    locality
  • This is lossy compression but impact on locality
    can be reduced to be negligible

15
Trace Compaction (contd)
  • Divide the address space into lines of size
    typical cache line - record only the line number,
    not full address.
  • Impact on spatial locality should be minimal
  • Divide the temporal sequence of line numbers into
    clusters. Reordering within a cluster is allowed.
  • Cluster size should be much smaller than the
    smallest expected cache size so temporal
    locality is not affected by reordering

16
Automatic Skeleton Construction Framework
Model
Data
GUI
Sim 1
Pre
Stream
Skeleton
Application
Generate memory skeleton.
Collect data address trace samples of the
application.
Summarize the trace samples.
17
Memory Skeleton Generation
  • Create a C-program that synthetically generates a
    sequence
  • of memory references recorded in the previous
    step.
  • Challenges
  • Minimizing extraneous address references.
  • - Any executing program has memory accesses of
    its own.
  • - Generate a loop structure for each cluster.
  • - Reading linenumber-frequency pairs once leads
    to series of actual memory references from the
    trace without intervening address reads.

18
Skeleton Generation (contd)
  • Eliminating cache corruption
  • - Reading trace data from disk impacts the
    memory simulation.
  • Use 2nd machine, read data on one machine and
    send it through sockets to the other machine
    where simulation is done.
  • Socket buffer is kept to a very small size
  • Also reduces overhead on the main simulation
    machine.

19
Skeleton Generation (contd)
  • Allocating memory
  • - The regions of virtual memory that will
    actually be used are not known prior to
    simulation
  • Dynamic block allocation
  • Substantial size block of memory is allocated
    when an address reference is made to a location
    that is not allocated.
  • Maintain Sparse Index Table to access the
    blocks.

20
Experiments and Results
  • Skeletons constructed for Class-W NAS serial
    benchmarks
  • (and Class-A IS benchmark).
  • Experiments conducted on Intel Xeon dual CPU 1.7
    GHz machines with 256KB 8-way set associative L2
    cache and 64 byte lines, running Linux 2.4.7.
  • Objectives
  • Prediction of cache miss ratio of corresponding
    applications.
  • Predictions across different memory hierarchies.

Trace slices picked uniformly throughout the
trace for all the experiments.
21
Prediction of Cache Miss Ratios
Trace Sampling ratio 10
Trace slice size gt 10 million references No. of
slices picked gt 10
Average error lt 5
Comparison of data cache miss rate of benchmarks
and corresponding memory skeletons.
IS application is an exception.
22
Impact of trace slice selection
  • Traces for IS, BT and MG benchmarks divided into
    100 uniform slices.
  • 10 different versions of skeletons generated
    each using different sets of 10 uniformly spaced
    trace slices.

- MG and BT have similar cache miss rates in all
cases.
- IS has significant variation in cache miss
prediction with different sets of slices.
Reason IS execution goes through different
phases with different memory access behavior
unlike CG and BT.
Data cache miss rates for different sets of
trace slices in skeletons.
Actual Data cache miss rates of IS 3.9 BT
2.76 MG 1.57
23
Impact of trace slice selection (contd)
- Greater the number of trace slices in an
application trace, smaller the size of each trace
slice.
True cache miss ratio
- Having large number of slices in the trace
captures the multiphase behavior of applications.
Data cache miss rates of IS skeletons with
different sets of slices and for different
number of slices.
24
Impact of trace slice size
- The cache miss ratio prediction error increases
rapidly when the slice sizes are reduced below a
certain point.
Error in cache miss prediction with memory
skeletons for different trace slice sizes for MG.
25
Prediction across hardware platforms
Cache miss ratios were predicted fairly
accurately with error lt 5 across all machines.
Cache miss comparison of CG benchmark and its
skeleton across different memory hierarchies.
26
Conclusion and Discussions
  • Presents a methodology to build memory skeletons
    for prediction of application cache miss ratio
    across hardware platforms.
  • A step towards building good performance
    skeletons
  • Extends our groups previous work on skeletons to
    memory characteristics.
  • Major Contribution
  • Low overhead generation of memory accesses from a
    trace.

27
Conclusion and Discussions (Contd)
  • Limitations
  • Instruction references
  • Space and time Overhead
  • Timing accuracy
  • Integration with Communication and CPU events
Write a Comment
User Comments (0)
About PowerShow.com