Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems
1Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems
- Dr. Alan D. George, Principal Investigator
- Mr. Eric M. Grobelny, Sr. Research Assistant
- HCS Research Laboratory
- University of Florida
2Objectives and Motivations
- High-performance computing involves applications
that require parallelization for feasible
completion time - HPC systems used for distributed computing
- Issues with heterogeneity
- Efficiency of hardware resource usage
- Execution time
- Overhead
- Challenge Find optimum configuration of
resources and task distribution for key
applications under study - Nearly impossible and too expensive to determine
experimentally - Simulation tools are required
- Challenges with simulation approach
- Large, complex systems
- Balance speed and fidelity
3FASE Overview
- FASE Fast and Accurate Simulation Environment
- Goals
- Find optimum system configuration on which to run
specific application - Performance analysis of specific application
running on a system - Identify bottlenecks in this configuration and
optimize program - Use mixture of pre-simulation and simulation
- Pre-simulation Extraction of key
characteristics of application and abstraction of
lesser influential components - Simulation Use pre-simulation results to
determine overall performance on currently
unavailable systems - MLDesigner discrete-event simulation
environment - Block-oriented, hierarchical modeling paradigm to
minimize development time - Sacrifices some speed for user-friendly interface
- Related work conducted at San Diego Supercomputer
Center (PMaC project), European Center for
Parallelism of Barcelona (Dimemas), U of Illinois
(SvPablo), U of Wisconsin (Paradyn), and U of
Oregon (TAU)
4FASE Process Flow Diagram
- Input parallel program into Script Generator
- Code instrumented and executed
- Scripts created during execution
- Post-processing conducted
- Script files read in by MLDesigner
- One script per simulated processor
- When all script files have been completed,
simulation is complete and statistics are reported
FASE process flow diagram
5Script Generator
- Extracts key characteristics from program to
drive simulation environment - Features
- Supported languages C and C, MPI, and Cray
SHMEM - Automatic instrumentation of selected MPI and
SHMEM functions - Supported MPI functions
- MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
MPI_Bcast, MPI_Reduce - Supported SHMEM functions
- shmem_get, shmem_put
- Foundation in place for easy addition of other
functions - Non-communication events abstracted by simple
timing - Times scaled during simulation to represent
different machine - Scripts generated by running binary executable
- Traced events from instrumentation output to
files - Script files drive simulation models
- Post-processing
- Overhead from timing function calculated and
reported during application execution - Average overhead subtracted from all
non-communication events
Script generator data flow
6Network Models SCI
- Scalable Coherent Interface
- Direct and indirect network
- 1D, 2D, and 3D direct topologies most widely used
- SCI link controller architecture key components
- Output queue (red oval)
- Hold waiting, in-transit, or retry
request/response packets - Input queue (blue oval)
- Hold waiting request/response packets for
transmission to host processor - Bypass queue (green oval)
- Hold packets not destined for host machine
- Normal SCI data flow in bottom right figure
- Simulation
- Packet-level simulation to cut down simulation
time - Used to analyze system performance using
high-speed interconnect - Scale hardware parameters to determine benefits
of future generations
SCI link controller architecture
SCI data flow
7Network Models InfiniBand
- InfiniBand components
- Consumers communicate with each other through
Host Channel Adapters (HCAs) - Subnet Managers (SMs) give least congested path
to consumer - Subnet Management Agent (SMA) gives SM least
congested Virtual Link (VL) and the associated
port - InfiniBand Switch (not modeled) connects multiple
HCAs and subnets - HCA (main component)
- Port queue pool (light blue oval)
- VL queue pool (purple oval)
- Queue Pair pool (light green oval)
- SMA (black oval)
High-level InfiniBand diagram
HCA MLD model
8Network Models TCP
- TCP Library
- Layered architecture (application, network,
datalink) - TCP layer contains most of librarys
functionality - TCP module components
- Sending segments (grey oval)
- Receiving segments (purple oval)
- Allocating buffer memory (orange oval)
- Generating acknowledgements and calculating
timeout values (black oval)
- FASE TCP Node Architecture
- TCP module (red oval)
- Relays application data to/from the FASE
integration components - FASE integration components (blue oval)
- Interface for the TCP library for script
generator functionality
FASE TCP node architecture
TCP module
9Network Models RapidIO
- Embedded systems switched interconnect
- Three layer architecture
- Physical, transport, logical layers
- Model features
- Message-passing logical layer
- Parallel physical layer
- Wide variety of adjustable model parameters
- Physical, logical, transport layers
- Supports error injection andrecovery (CRC
failures) - Two methods of flow control
- Transmitter-controlled
- Receiver-controlled
- 6- and 8-port central memory switch model
RapidIO four-switch backplane
10Network Model Validations
- Ping program
- Message sizes 1 byte to 8 MB
- Average of 1000 iterations for each message size
- Compare experimental and simulation throughputs
- SCI
- Average error 3.3
- InfiniBand
- Average error 1.211
- TCP
- Average error 1.64
11Experimental Setup
- System configurations
- 2-node dual 2.4 GHz Xeon, 1GB RAM
- SCI 1D ring topology using Dolphin D33X
adapter, Scali SSP 4.2.1 for MPI - InfiniBand InfiniCon switch and HCAs, InfiniCon
MPI - TCP Over Intel Pro/1000 GigE adapter, Cisco
Catalyst switch, LAM MPI - Red Hat Linux 9 with kernel version 2.4.20-8 or
corresponding patched kernel for SCI/InfiniBand
support - Both hardware measurements and scripts obtained
with dual Xeon nodes - Experiments
- Average of 25 iterations
- Matrix multiply
- Four sizes 50x50, 100x100, 250x250, and 500x500
- Bench 12
- 9 permutations of (powers of 2) main table 15,
18, 20 and substitution table 8, 9, 10
12Results SCI
- Matrix Multiply
- Error higher (7.5) at smaller sizes due to
small time values and large deviations in
experimental data - Dominated by computation, therefore deviations
during script generation can greatly affect
results - Bench 12
- Experimental deviations under 2.5
- Effective errors within 2 and 5
- Assumes standard deviation of error incurred
during each computational time measurement - Errors of simulation runs versus experimental
variation errors - High experimental deviation leads to higher error
in simulation - Ideally want simulation error within experimental
deviation error
13Results InfiniBand
- Matrix Multiply
- High error (25) at small matrix size similar
to SCI - For larger sizes, error is much smaller (less
than 4) - Error percentage decreases as size increases
- Wall clock simulation times
- 50x50 slowdown factor of 3650
- 500x500 slowdown factor of 16
- Bench 12
- Errors under 1.5 for each case
- Experimental deviations under 2
- Wall clock simulation times
- 1200 times greater at main table size 215
- 210 times greater at main table size 220
14Results TCP
- Matrix Multiply
- Error higher (15) at smaller sizes similar to
SCI and InfiniBand - Larger sizes have simulation errors less than the
experimental deviation - Wall clock simulation times
- 50x50 slowdown factor of 140
- 500x500 slowdown factor of 6
- Bench12
- Error less than 9 for all cases and effective
errors less than 4 - Simulation error within experimental deviation
for table size 215 - Wall clock simulation times
- 780 times greater at main table size 215
- 940 times greater at main table size 220
15RC Model
- RC arena has been dominated by experimentation
but little done in simulation - Simulation
- Predict performance gains on future and more
advanced systems - Determine optimal workload and data distribution
- Predict performance in emerging realms of RC
- Resource management
- Independent RC fabric communication (i.e. without
processor) - Large-scale HPC (e.g. Cray XD1, SRC MAP
processor, and SGI) - Current models
- RC fabric with management unit (red oval) and
dynamically created and reconfigured functional
units (blue oval) - Dynamically created RC fabrics (green oval)
- Interface with FASE for script support (purple
oval) - Multiple host processor/fabric interconnects
supported - RC node figure illustrates PCI bus interface
(orange oval), but could plug in RIO, IBA, SCI,
etc - Support for inter-fabric communication
RC fabric
RC node
16Experiments and Results RC Model
Blowfish with FFT results
- Objective
- Determine how interconnect enhancements improve
RC application performance - Experiments
- Blowfish algorithm with FFT
- Sends 4096-bit encrypted message to RC board over
PCI bus in 32-bit chunks - Experimental system used 33MHz, 32-bit PCI bus
- Functional unit configurations and
initializations not included - Test Cases
- Varied PCI specifications
- 33MHz/32-bit data path
- 66MHz/64-bit data path
- 100MHz/64-bit data path
- Varied bus arbitration penalties
- 0, 3, 5, 7 clock cycles
- Analysis of preliminary results
- 33MHz, 32-bit configuration most closely matches
experimental system - Results closely match (0.4 - 1.4 error)
- Other configurations close to ideal execution
time - Deduction PCI bus not a major factor in this
algorithm - Due to small amount of data transferred
17Case Study Low-Locality Applications
- Proposed Solutions
- Reference Classification
- Instruction-based Tyso95, Gonz95, Wong97
- Our case study considers two variations of
instruction-based approach - AMD Opteron takes instruction-based approach
- Memory address-based Memi03, John97
- MTRR is an example of memory address-based
approach
- Current Technology Existing Support
- Intel P6 µarch and later Inno02
- Uses Memory Type Range Registers (MTRR)
- 5 possible memory-type classifications
- Write-through, write-back, write-protect,
write-combining, uncacheable - Specify up to 8 address ranges
- Accessed through command prompt, assembly
- AMD processors implement the MTRR as well
Adva99 - AMD Opteron Wilk02
- Added instructions to ISA
- 3 prefetch instructions
- Fetch into L1 only, L2 only, or into L1 and L2
- Streaming store instruction
- Store directly from write buffer to memory
- Do not place in cache or replace cache blocks
- Extensive compiler optimization research
- Not considering software solutions
- Optimizations simulated using SimpleScalar
- Benchmarks include
- Table Toy (Bench 12)
- Integer Sort (Bench 2)
- Integer Selection and Summation (Bench 5)
- Represent a range of good and bad locality
18Baseline Results
- Simulation Parameters
- L1 data cache
- 16KB, 2-way set associative, 32B blocks
- 2 cc access latency
- L2 unified
- 256KB, 4-way set associative, 64B blocks
- 7 cc access latency
- Main memory
- 64-bit bus width
- 60 cc first-word latency, 2 cc each word
- Benchmark Parameters
- Bench 2 Integer size 64b Number of integers
1.5x107 - Bench 5 Number of integers 108
- Bench 12 Main Table 226 Sub. Table 217
- Metrics to measure
- CPI is main statistic of interest
- Proportional to execution time ( CLK x CPI x IC
) - L1 and L2 data cache access count, miss count
- Profile memory access behavior
Baseline CPI
Baseline miss rates
19Instruction-based Bypassing Dynamic
- Incorporate Dynamic Bypass Table (DBT) before
cache - DBT design based on that found in WONG97
- Small, cache-speed memory monitors memory
reference instructions
- Table to store n 8B entries
- 3 fields per entry
- PC instruction identifier
- SC saturating counter
- Increment on miss, decrement on hit
- If reaches a threshold, make instruction NC
- RC reference counter
- Used for replacement policy
- Incremented each time instruction is executed
- Hardware requirements
- (8 x n) bytes storage, L1 cache speed
- Plus control logic
- DBT determines instructions with poor locality
- Table size, saturation, replace. policy affect
determination - Replacement policy is least executed or fewest
accesses
Saturating counter m bits, m lt 9
Threshold 255 Threshold
2m - 1
Dynamic Bypass Table
20Instruction-based Bypassing Dynamic
- Dynamic solution provides benefit for Bench 12
- Degrades performance on bench 2
- Too many instructions saturate
- Bench 5 barely affected
- Almost no instructions saturate
- Simulator outputs DBT contents at end of
simulation - Check to see which instructions saturated
- Performance w/ multiple saturation values
- DBT size constant at 1024 entries
- Higher saturation ? more tolerance before
declaring NC - No more improvement for thresholds above 15
- Very small threshold too sensitive
- Effect of varying DBT size
- Saturate threshold constant at 15
- For these benchmarks, 128 entries is enough
- Less entries ? smaller hardware requirements
- 128 entries 1KB logic
Variable saturation
Variable DBT size
21Discussion of Results
- Instruction-based NC approach
- Bench 2 and 5 not good candidates for NC solution
- Sources of improved performance from NC
- Reserve cache space for references with good
locality - No L1/L2 access penalty waiting for hit/miss
decision - Static approach is synonymous with dynamic
- Add new instructions to ISA, more control over NC
- No DBT, programmer chooses NC refs pre-compile
- Static approach less versatile but expected to
achieve better performance than DBT experiments
are well underway
Main loop of Bench 12 Static approach
- Hardware costs of proposed solutions
- Dynamic
- Cache-like memory, around 1 KB logic
- Static
- New instructions added to ISA
- Both require word-addressable direct memory port
- CPI for faster access latencies
- Reconsidering existing hardware
- Instruction-based support found in Opteron
- Only has non-caching store, no matching load
- MTRR can be used in memory-based approach
- Would require further research, though MTRR not
intended for this purpose
Variable NC access latency
22Conclusions
- FASE
- Pre-simulation process characterizes application
- Captures key parameters of communication events
(subset of MPI and SHMEM library calls) to drive
simulation - Non-communication areas use timing relies on
scaling factor for modeling other computational
components - Use hardware for timing FAST
- Simulation
- Used to accurately model communication events and
scale other events - Can use any network model in FASE library in a
system configuration - User-definable parameters to customize network
and computational unit settings - Network models
- All models produced acceptable results with
errors under 8 in most cases - InfiniBand and TCP models showed simulation
slowdowns on the order of 103 - Verified RapidIO model with experiments underway
- Models will be optimized to increase accuracy
while minimizing simulation time - RC model
- Support for multiple RC nodes, multiple RC
fabrics on each node, and multiple,
reconfigurable functional units in each fabric - Model still in infancy stage, but results
obtained are promising - Need to test other algorithms, vary other
parameters, and explore advantages of
inter-fabric/inter-node communication potential
without host intervention - Low-locality case study
23Open Issues For FASE
- Dynamic applications
- Scripts inherently static cannot capture any
dynamic information (analogy snapshot vs. live
video) - Potential aid Hardware-in-the-loop
- e.g. Pause application and send data to MLD
MLD sends back information application resumes - Instrumentation and processor modeling
- Current instrumented code times computation and
affects actual program execution - e.g. Extra code can cause cache misses that
would otherwise not happen - Potential solution Eliminate timing and relate
computation to source or assembly code - e.g. Count loop iterations with knowledge of
instruction types and number in loop - Model memory hierarchy to capture memory related
performance issues - Prediction
- Pre-simulation/script generation results in data
for one specific case (e.g. 2 processor system) - Potential aid 1 Extract information on task
assignment in relation to system size - Potential aid 2 Analyze trends and extrapolate
- Scaling factor determined by fitting curves to
data points of specific application (assuming
timing of computation is not replaced) with
respect to data set size - Potential solution Classify application in terms
of limiting factor (e.g. memory, cpu, io, etc)
and use representative benchmarks
24Future Work
- FASE
- Support other programming languages
- More completely support MPI and SHMEM programming
languages - Devise scheme to support UPC
- Model other components
- Continue enhancing RC models
- Potential modeling of memory hierarchy, storage
devices, WAN and grid computing components, etc. - Optimize existing network models for speed
without sacrificing accuracy - Run experiments with larger systems
- Devise roadmap for solving/considering issues in
previous slide - Low-locality case study
- Many simple variations of proposed solutions
- Memory address-based classification
- Monitor additional/different metrics besides NC
- Extend scalar simulations to include
multi-processor systems - Benchmarks intended to be run on parallel
machines - Extended memory hierarchy adds another layer of
complexity to locality - Possible integration with FASE, model memory
hierarchy in node components
25References
- Adva99 Advanced Micro Devices, Inc.,
Implementation of Write Allocate in the K86
Processors, AMD Whitepaper, Publication No.
21326, Feb. 1999 - Aust97 T. Austin, A Users and Hackers Guide
to the SimpleScalar Architectural Research Tool
Set, Intel MicroComputers Research Labs
(http//www.simplescalar.com/docs.html) Jan.
1997. - Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
Data Cache with Multiple Caching Strategies
Tuned to Different Types of Locality,
Department of Computer Architecture, University
of Catalunya Polytechnical, Barcelona 1995. - Inno02 R. Innocente, HPC on linux clusters
Nodes and Networks hardware, Tutorial for
ICTP-School, International School for Advanced
Studies, Feb. 2002 - John97 T. Johnson, M. Merten, W. Hwu, Runtime
Spatial Locality Detection and Optimization,
Center for Reliable and High-Performance
Computing, University of Illinois,
Urbana- Champaign, IL 1997. - John98 T. Johnson, D. Connors, W. Hwu,
Run-time Adaptive Cache Management, Center for
Reliable and High-Performance Computing,
University of Illinois, Urbana-Champaign, IL
1998. - Memi03 G. Memik, M. Kandemir, A. Choudary, I
Kadayif, An Integrated Approach for Improving
Cache Behavior, Proceedings of the
Design,Automation and Test in Europe Conference
and Exhibition, 2003. - Tyso95 G. Tyson, M. Farrens, J. Matthews, A.
Pleszkun, A Modified Approach to Data Cache
Management, 1995. - Wilk02 T. Wilkens, Optimizing for the AMD
Opteron Processor, AMD Developer Symposium, Oct.
2002 - Wong97 W. Wong, Hardware Techniques for Data
Placement, Department of Computer Science and
Engineering, University of Washington, Seattle,
WA Aug 1997.
26Performance Prediction Setup
- 2-node 1.33 GHz Athlon, 256MB RAM
- 3Com FastE adapter, Nortel PassPort switch, MPICH
- Red Hat Linux 9 with kernel version 2.4.20-8
- Scaling factor for prediction
- Sequential version of matrix multiply
- Sizes analyzed 50x50, 100x100, 250x250,
500x500, 1000x1000, 2000x2000 - Average execution time over 25 iterations for
each size - Scaling factor Divided average execution time
Athlon machine by Xeon execution time - Data set size for each case estimated by hand and
related to scaling factor - Performance prediction only conducted using
substitution table 9
- Two scaling factors used
- Constant factor 2
- Variable factor determined by best-fit quadratic
equation to data points - Equation ax2 bx c
- a -1.8778e-9, b 1.96274e-4, c 1.908562
27Performance Prediction Results
- Performance prediction
- Large range of error (0.5 to 30)
- Matrix multiply errors for variable scaling
factor relatively low - Due to use of its scalar version to determine
scaling factor - Error larger with larger matrix size due to
quadratic curve fitting - More complex equation(s) needed
- Bench 12 errors under 20 and fairly constant for
both scaling factors - Data set size estimation may be too rough
- Matrix Multiply and Bench 12 too different for
correlation - Different types of applications represented by
specific bounded programs (i.e. memory bounded
program must use memory bounded benchmark to get
scaling factor) - More research needed in this area