Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems

Description:

Performance Modeling and Simulation for Tradeoff ... Script files drive simulation models. Post-processing ... Experiments and Results: RC Model. Objective ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 28

Provided by: Eri5165

Category:

more less

Transcript and Presenter's Notes

Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems

1
Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems

Dr. Alan D. George, Principal Investigator
Mr. Eric M. Grobelny, Sr. Research Assistant
HCS Research Laboratory
University of Florida

2
Objectives and Motivations

High-performance computing involves applications
that require parallelization for feasible
completion time
HPC systems used for distributed computing
Issues with heterogeneity
Efficiency of hardware resource usage
Execution time
Overhead
Challenge Find optimum configuration of
resources and task distribution for key
applications under study
Nearly impossible and too expensive to determine
experimentally
Simulation tools are required
Challenges with simulation approach
Large, complex systems
Balance speed and fidelity

3
FASE Overview

FASE Fast and Accurate Simulation Environment
Goals
Find optimum system configuration on which to run
specific application
Performance analysis of specific application
running on a system
Identify bottlenecks in this configuration and
optimize program
Use mixture of pre-simulation and simulation
Pre-simulation Extraction of key
characteristics of application and abstraction of
lesser influential components
Simulation Use pre-simulation results to
determine overall performance on currently
unavailable systems
MLDesigner discrete-event simulation
environment
Block-oriented, hierarchical modeling paradigm to
minimize development time
Sacrifices some speed for user-friendly interface
Related work conducted at San Diego Supercomputer
Center (PMaC project), European Center for
Parallelism of Barcelona (Dimemas), U of Illinois
(SvPablo), U of Wisconsin (Paradyn), and U of
Oregon (TAU)

4
FASE Process Flow Diagram

Input parallel program into Script Generator

Code instrumented and executed
Scripts created during execution
Post-processing conducted

Script files read in by MLDesigner
One script per simulated processor

When all script files have been completed,
simulation is complete and statistics are reported

FASE process flow diagram
5
Script Generator

Extracts key characteristics from program to
drive simulation environment
Features
Supported languages C and C, MPI, and Cray
SHMEM
Automatic instrumentation of selected MPI and
SHMEM functions
Supported MPI functions
MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
MPI_Bcast, MPI_Reduce
Supported SHMEM functions
shmem_get, shmem_put
Foundation in place for easy addition of other
functions
Non-communication events abstracted by simple
timing
Times scaled during simulation to represent
different machine
Scripts generated by running binary executable
Traced events from instrumentation output to
files
Script files drive simulation models
Post-processing
Overhead from timing function calculated and
reported during application execution
Average overhead subtracted from all
non-communication events

Script generator data flow
6
Network Models SCI

Scalable Coherent Interface
Direct and indirect network
1D, 2D, and 3D direct topologies most widely used
SCI link controller architecture key components
Output queue (red oval)
Hold waiting, in-transit, or retry
request/response packets
Input queue (blue oval)
Hold waiting request/response packets for
transmission to host processor
Bypass queue (green oval)
Hold packets not destined for host machine
Normal SCI data flow in bottom right figure
Simulation
Packet-level simulation to cut down simulation
time
Used to analyze system performance using
high-speed interconnect
Scale hardware parameters to determine benefits
of future generations

SCI link controller architecture
SCI data flow
7
Network Models InfiniBand

InfiniBand components
Consumers communicate with each other through
Host Channel Adapters (HCAs)
Subnet Managers (SMs) give least congested path
to consumer
Subnet Management Agent (SMA) gives SM least
congested Virtual Link (VL) and the associated
port
InfiniBand Switch (not modeled) connects multiple
HCAs and subnets
HCA (main component)
Port queue pool (light blue oval)
VL queue pool (purple oval)
Queue Pair pool (light green oval)
SMA (black oval)

High-level InfiniBand diagram
HCA MLD model
8
Network Models TCP

TCP Library
Layered architecture (application, network,
datalink)
TCP layer contains most of librarys
functionality
TCP module components
Sending segments (grey oval)
Receiving segments (purple oval)
Allocating buffer memory (orange oval)
Generating acknowledgements and calculating
timeout values (black oval)

FASE TCP Node Architecture
TCP module (red oval)
Relays application data to/from the FASE
integration components
FASE integration components (blue oval)
Interface for the TCP library for script
generator functionality

FASE TCP node architecture
TCP module
9
Network Models RapidIO

Embedded systems switched interconnect
Three layer architecture
Physical, transport, logical layers
Model features
Message-passing logical layer
Parallel physical layer
Wide variety of adjustable model parameters
Physical, logical, transport layers
Supports error injection andrecovery (CRC
failures)
Two methods of flow control
Transmitter-controlled
Receiver-controlled
6- and 8-port central memory switch model

RapidIO four-switch backplane
10
Network Model Validations

Ping program
Message sizes 1 byte to 8 MB
Average of 1000 iterations for each message size
Compare experimental and simulation throughputs
SCI
Average error 3.3
InfiniBand
Average error 1.211
TCP
Average error 1.64

11
Experimental Setup

System configurations
2-node dual 2.4 GHz Xeon, 1GB RAM
SCI 1D ring topology using Dolphin D33X
adapter, Scali SSP 4.2.1 for MPI
InfiniBand InfiniCon switch and HCAs, InfiniCon
MPI
TCP Over Intel Pro/1000 GigE adapter, Cisco
Catalyst switch, LAM MPI
Red Hat Linux 9 with kernel version 2.4.20-8 or
corresponding patched kernel for SCI/InfiniBand
support
Both hardware measurements and scripts obtained
with dual Xeon nodes
Experiments
Average of 25 iterations
Matrix multiply
Four sizes 50x50, 100x100, 250x250, and 500x500
Bench 12
9 permutations of (powers of 2) main table 15,
18, 20 and substitution table 8, 9, 10

12
Results SCI

Matrix Multiply
Error higher (7.5) at smaller sizes due to
small time values and large deviations in
experimental data
Dominated by computation, therefore deviations
during script generation can greatly affect
results
Bench 12
Experimental deviations under 2.5
Effective errors within 2 and 5
Assumes standard deviation of error incurred
during each computational time measurement
Errors of simulation runs versus experimental
variation errors
High experimental deviation leads to higher error
in simulation
Ideally want simulation error within experimental
deviation error

13
Results InfiniBand

Matrix Multiply
High error (25) at small matrix size similar
to SCI
For larger sizes, error is much smaller (less
than 4)
Error percentage decreases as size increases
Wall clock simulation times
50x50 slowdown factor of 3650
500x500 slowdown factor of 16
Bench 12
Errors under 1.5 for each case
Experimental deviations under 2
Wall clock simulation times
1200 times greater at main table size 215
210 times greater at main table size 220

14
Results TCP

Matrix Multiply
Error higher (15) at smaller sizes similar to
SCI and InfiniBand
Larger sizes have simulation errors less than the
experimental deviation
Wall clock simulation times
50x50 slowdown factor of 140
500x500 slowdown factor of 6
Bench12
Error less than 9 for all cases and effective
errors less than 4
Simulation error within experimental deviation
for table size 215
Wall clock simulation times
780 times greater at main table size 215
940 times greater at main table size 220

15
RC Model

RC arena has been dominated by experimentation
but little done in simulation
Simulation
Predict performance gains on future and more
advanced systems
Determine optimal workload and data distribution
Predict performance in emerging realms of RC
Resource management
Independent RC fabric communication (i.e. without
processor)
Large-scale HPC (e.g. Cray XD1, SRC MAP
processor, and SGI)
Current models
RC fabric with management unit (red oval) and
dynamically created and reconfigured functional
units (blue oval)
Dynamically created RC fabrics (green oval)
Interface with FASE for script support (purple
oval)
Multiple host processor/fabric interconnects
supported
RC node figure illustrates PCI bus interface
(orange oval), but could plug in RIO, IBA, SCI,
etc
Support for inter-fabric communication

RC fabric
RC node
16
Experiments and Results RC Model
Blowfish with FFT results

Objective
Determine how interconnect enhancements improve
RC application performance
Experiments
Blowfish algorithm with FFT
Sends 4096-bit encrypted message to RC board over
PCI bus in 32-bit chunks
Experimental system used 33MHz, 32-bit PCI bus
Functional unit configurations and
initializations not included
Test Cases
Varied PCI specifications
33MHz/32-bit data path
66MHz/64-bit data path
100MHz/64-bit data path
Varied bus arbitration penalties
0, 3, 5, 7 clock cycles

Analysis of preliminary results
33MHz, 32-bit configuration most closely matches
experimental system
Results closely match (0.4 - 1.4 error)
Other configurations close to ideal execution
time
Deduction PCI bus not a major factor in this
algorithm
Due to small amount of data transferred

17
Case Study Low-Locality Applications

Proposed Solutions
Reference Classification
Instruction-based Tyso95, Gonz95, Wong97
Our case study considers two variations of
instruction-based approach
AMD Opteron takes instruction-based approach
Memory address-based Memi03, John97
MTRR is an example of memory address-based
approach

Current Technology Existing Support
Intel P6 µarch and later Inno02
Uses Memory Type Range Registers (MTRR)
5 possible memory-type classifications
Write-through, write-back, write-protect,
write-combining, uncacheable
Specify up to 8 address ranges
Accessed through command prompt, assembly
AMD processors implement the MTRR as well
Adva99
AMD Opteron Wilk02
Added instructions to ISA
3 prefetch instructions
Fetch into L1 only, L2 only, or into L1 and L2
Streaming store instruction
Store directly from write buffer to memory
Do not place in cache or replace cache blocks
Extensive compiler optimization research
Not considering software solutions

Optimizations simulated using SimpleScalar
Benchmarks include
Table Toy (Bench 12)
Integer Sort (Bench 2)
Integer Selection and Summation (Bench 5)
Represent a range of good and bad locality

18
Baseline Results

Simulation Parameters
L1 data cache
16KB, 2-way set associative, 32B blocks
2 cc access latency
L2 unified
256KB, 4-way set associative, 64B blocks
7 cc access latency
Main memory
64-bit bus width
60 cc first-word latency, 2 cc each word
Benchmark Parameters
Bench 2 Integer size 64b Number of integers
1.5x107
Bench 5 Number of integers 108
Bench 12 Main Table 226 Sub. Table 217
Metrics to measure
CPI is main statistic of interest
Proportional to execution time ( CLK x CPI x IC
)
L1 and L2 data cache access count, miss count
Profile memory access behavior

Baseline CPI
Baseline miss rates
19
Instruction-based Bypassing Dynamic

Incorporate Dynamic Bypass Table (DBT) before
cache
DBT design based on that found in WONG97
Small, cache-speed memory monitors memory
reference instructions

Table to store n 8B entries
3 fields per entry
PC instruction identifier
SC saturating counter
Increment on miss, decrement on hit
If reaches a threshold, make instruction NC
RC reference counter
Used for replacement policy
Incremented each time instruction is executed
Hardware requirements
(8 x n) bytes storage, L1 cache speed
Plus control logic
DBT determines instructions with poor locality
Table size, saturation, replace. policy affect
determination
Replacement policy is least executed or fewest
accesses

Saturating counter m bits, m lt 9
Threshold 255 Threshold
2m - 1
Dynamic Bypass Table
20
Instruction-based Bypassing Dynamic

Dynamic solution provides benefit for Bench 12
Degrades performance on bench 2
Too many instructions saturate
Bench 5 barely affected
Almost no instructions saturate
Simulator outputs DBT contents at end of
simulation
Check to see which instructions saturated
Performance w/ multiple saturation values
DBT size constant at 1024 entries
Higher saturation ? more tolerance before
declaring NC
No more improvement for thresholds above 15
Very small threshold too sensitive
Effect of varying DBT size
Saturate threshold constant at 15
For these benchmarks, 128 entries is enough
Less entries ? smaller hardware requirements
128 entries 1KB logic

Variable saturation
Variable DBT size
21
Discussion of Results

Instruction-based NC approach
Bench 2 and 5 not good candidates for NC solution
Sources of improved performance from NC
Reserve cache space for references with good
locality
No L1/L2 access penalty waiting for hit/miss
decision
Static approach is synonymous with dynamic
Add new instructions to ISA, more control over NC
No DBT, programmer chooses NC refs pre-compile
Static approach less versatile but expected to
achieve better performance than DBT experiments
are well underway

Main loop of Bench 12 Static approach

Hardware costs of proposed solutions
Dynamic
Cache-like memory, around 1 KB logic
Static
New instructions added to ISA
Both require word-addressable direct memory port
CPI for faster access latencies
Reconsidering existing hardware
Instruction-based support found in Opteron
Only has non-caching store, no matching load
MTRR can be used in memory-based approach
Would require further research, though MTRR not
intended for this purpose

Variable NC access latency
22
Conclusions

FASE
Pre-simulation process characterizes application
Captures key parameters of communication events
(subset of MPI and SHMEM library calls) to drive
simulation
Non-communication areas use timing relies on
scaling factor for modeling other computational
components
Use hardware for timing FAST
Simulation
Used to accurately model communication events and
scale other events
Can use any network model in FASE library in a
system configuration
User-definable parameters to customize network
and computational unit settings
Network models
All models produced acceptable results with
errors under 8 in most cases
InfiniBand and TCP models showed simulation
slowdowns on the order of 103
Verified RapidIO model with experiments underway
Models will be optimized to increase accuracy
while minimizing simulation time
RC model
Support for multiple RC nodes, multiple RC
fabrics on each node, and multiple,
reconfigurable functional units in each fabric
Model still in infancy stage, but results
obtained are promising
Need to test other algorithms, vary other
parameters, and explore advantages of
inter-fabric/inter-node communication potential
without host intervention
Low-locality case study

23
Open Issues For FASE

Dynamic applications
Scripts inherently static cannot capture any
dynamic information (analogy snapshot vs. live
video)
Potential aid Hardware-in-the-loop
e.g. Pause application and send data to MLD
MLD sends back information application resumes
Instrumentation and processor modeling
Current instrumented code times computation and
affects actual program execution
e.g. Extra code can cause cache misses that
would otherwise not happen
Potential solution Eliminate timing and relate
computation to source or assembly code
e.g. Count loop iterations with knowledge of
instruction types and number in loop
Model memory hierarchy to capture memory related
performance issues
Prediction
Pre-simulation/script generation results in data
for one specific case (e.g. 2 processor system)
Potential aid 1 Extract information on task
assignment in relation to system size
Potential aid 2 Analyze trends and extrapolate
Scaling factor determined by fitting curves to
data points of specific application (assuming
timing of computation is not replaced) with
respect to data set size
Potential solution Classify application in terms
of limiting factor (e.g. memory, cpu, io, etc)
and use representative benchmarks

24
Future Work

FASE
Support other programming languages
More completely support MPI and SHMEM programming
languages
Devise scheme to support UPC
Model other components
Continue enhancing RC models
Potential modeling of memory hierarchy, storage
devices, WAN and grid computing components, etc.
Optimize existing network models for speed
without sacrificing accuracy
Run experiments with larger systems
Devise roadmap for solving/considering issues in
previous slide
Low-locality case study
Many simple variations of proposed solutions
Memory address-based classification
Monitor additional/different metrics besides NC
Extend scalar simulations to include
multi-processor systems
Benchmarks intended to be run on parallel
machines
Extended memory hierarchy adds another layer of
complexity to locality
Possible integration with FASE, model memory
hierarchy in node components

25
References

Adva99 Advanced Micro Devices, Inc.,
Implementation of Write Allocate in the K86
Processors, AMD Whitepaper, Publication No.
21326, Feb. 1999
Aust97 T. Austin, A Users and Hackers Guide
to the SimpleScalar Architectural Research Tool
Set, Intel MicroComputers Research Labs
(http//www.simplescalar.com/docs.html) Jan.
1997.
Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
Data Cache with Multiple Caching Strategies
Tuned to Different Types of Locality,
Department of Computer Architecture, University
of Catalunya Polytechnical, Barcelona 1995.
Inno02 R. Innocente, HPC on linux clusters
Nodes and Networks hardware, Tutorial for
ICTP-School, International School for Advanced
Studies, Feb. 2002
John97 T. Johnson, M. Merten, W. Hwu, Runtime
Spatial Locality Detection and Optimization,
Center for Reliable and High-Performance
Computing, University of Illinois,
Urbana- Champaign, IL 1997.
John98 T. Johnson, D. Connors, W. Hwu,
Run-time Adaptive Cache Management, Center for
Reliable and High-Performance Computing,
University of Illinois, Urbana-Champaign, IL
1998.
Memi03 G. Memik, M. Kandemir, A. Choudary, I
Kadayif, An Integrated Approach for Improving
Cache Behavior, Proceedings of the
Design,Automation and Test in Europe Conference
and Exhibition, 2003.
Tyso95 G. Tyson, M. Farrens, J. Matthews, A.
Pleszkun, A Modified Approach to Data Cache
Management, 1995.
Wilk02 T. Wilkens, Optimizing for the AMD
Opteron Processor, AMD Developer Symposium, Oct.
2002
Wong97 W. Wong, Hardware Techniques for Data
Placement, Department of Computer Science and
Engineering, University of Washington, Seattle,
WA Aug 1997.

26
Performance Prediction Setup

2-node 1.33 GHz Athlon, 256MB RAM
3Com FastE adapter, Nortel PassPort switch, MPICH
Red Hat Linux 9 with kernel version 2.4.20-8
Scaling factor for prediction
Sequential version of matrix multiply
Sizes analyzed 50x50, 100x100, 250x250,
500x500, 1000x1000, 2000x2000
Average execution time over 25 iterations for
each size
Scaling factor Divided average execution time
Athlon machine by Xeon execution time
Data set size for each case estimated by hand and
related to scaling factor
Performance prediction only conducted using
substitution table 9

Two scaling factors used
Constant factor 2
Variable factor determined by best-fit quadratic
equation to data points
Equation ax2 bx c
a -1.8778e-9, b 1.96274e-4, c 1.908562

27
Performance Prediction Results

Performance prediction
Large range of error (0.5 to 30)
Matrix multiply errors for variable scaling
factor relatively low
Due to use of its scalar version to determine
scaling factor
Error larger with larger matrix size due to
quadratic curve fitting
More complex equation(s) needed

Bench 12 errors under 20 and fairly constant for
both scaling factors
Data set size estimation may be too rough
Matrix Multiply and Bench 12 too different for
correlation
Different types of applications represented by
specific bounded programs (i.e. memory bounded
program must use memory bounded benchmark to get
scaling factor)
More research needed in this area