Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems

Description:

Performance Modeling and Simulation for Tradeoff ... Script files drive simulation models. Post-processing ... Experiments and Results: RC Model. Objective ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems


1
Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems
  • Dr. Alan D. George, Principal Investigator
  • Mr. Eric M. Grobelny, Sr. Research Assistant
  • HCS Research Laboratory
  • University of Florida

2
Objectives and Motivations
  • High-performance computing involves applications
    that require parallelization for feasible
    completion time
  • HPC systems used for distributed computing
  • Issues with heterogeneity
  • Efficiency of hardware resource usage
  • Execution time
  • Overhead
  • Challenge Find optimum configuration of
    resources and task distribution for key
    applications under study
  • Nearly impossible and too expensive to determine
    experimentally
  • Simulation tools are required
  • Challenges with simulation approach
  • Large, complex systems
  • Balance speed and fidelity

3
FASE Overview
  • FASE Fast and Accurate Simulation Environment
  • Goals
  • Find optimum system configuration on which to run
    specific application
  • Performance analysis of specific application
    running on a system
  • Identify bottlenecks in this configuration and
    optimize program
  • Use mixture of pre-simulation and simulation
  • Pre-simulation Extraction of key
    characteristics of application and abstraction of
    lesser influential components
  • Simulation Use pre-simulation results to
    determine overall performance on currently
    unavailable systems
  • MLDesigner discrete-event simulation
    environment
  • Block-oriented, hierarchical modeling paradigm to
    minimize development time
  • Sacrifices some speed for user-friendly interface
  • Related work conducted at San Diego Supercomputer
    Center (PMaC project), European Center for
    Parallelism of Barcelona (Dimemas), U of Illinois
    (SvPablo), U of Wisconsin (Paradyn), and U of
    Oregon (TAU)

4
FASE Process Flow Diagram
  • Input parallel program into Script Generator
  • Code instrumented and executed
  • Scripts created during execution
  • Post-processing conducted
  • Script files read in by MLDesigner
  • One script per simulated processor
  • When all script files have been completed,
    simulation is complete and statistics are reported

FASE process flow diagram
5
Script Generator
  • Extracts key characteristics from program to
    drive simulation environment
  • Features
  • Supported languages C and C, MPI, and Cray
    SHMEM
  • Automatic instrumentation of selected MPI and
    SHMEM functions
  • Supported MPI functions
  • MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
    MPI_Bcast, MPI_Reduce
  • Supported SHMEM functions
  • shmem_get, shmem_put
  • Foundation in place for easy addition of other
    functions
  • Non-communication events abstracted by simple
    timing
  • Times scaled during simulation to represent
    different machine
  • Scripts generated by running binary executable
  • Traced events from instrumentation output to
    files
  • Script files drive simulation models
  • Post-processing
  • Overhead from timing function calculated and
    reported during application execution
  • Average overhead subtracted from all
    non-communication events

Script generator data flow
6
Network Models SCI
  • Scalable Coherent Interface
  • Direct and indirect network
  • 1D, 2D, and 3D direct topologies most widely used
  • SCI link controller architecture key components
  • Output queue (red oval)
  • Hold waiting, in-transit, or retry
    request/response packets
  • Input queue (blue oval)
  • Hold waiting request/response packets for
    transmission to host processor
  • Bypass queue (green oval)
  • Hold packets not destined for host machine
  • Normal SCI data flow in bottom right figure
  • Simulation
  • Packet-level simulation to cut down simulation
    time
  • Used to analyze system performance using
    high-speed interconnect
  • Scale hardware parameters to determine benefits
    of future generations

SCI link controller architecture
SCI data flow
7
Network Models InfiniBand
  • InfiniBand components
  • Consumers communicate with each other through
    Host Channel Adapters (HCAs)
  • Subnet Managers (SMs) give least congested path
    to consumer
  • Subnet Management Agent (SMA) gives SM least
    congested Virtual Link (VL) and the associated
    port
  • InfiniBand Switch (not modeled) connects multiple
    HCAs and subnets
  • HCA (main component)
  • Port queue pool (light blue oval)
  • VL queue pool (purple oval)
  • Queue Pair pool (light green oval)
  • SMA (black oval)

High-level InfiniBand diagram
HCA MLD model
8
Network Models TCP
  • TCP Library
  • Layered architecture (application, network,
    datalink)
  • TCP layer contains most of librarys
    functionality
  • TCP module components
  • Sending segments (grey oval)
  • Receiving segments (purple oval)
  • Allocating buffer memory (orange oval)
  • Generating acknowledgements and calculating
    timeout values (black oval)
  • FASE TCP Node Architecture
  • TCP module (red oval)
  • Relays application data to/from the FASE
    integration components
  • FASE integration components (blue oval)
  • Interface for the TCP library for script
    generator functionality

FASE TCP node architecture
TCP module
9
Network Models RapidIO
  • Embedded systems switched interconnect
  • Three layer architecture
  • Physical, transport, logical layers
  • Model features
  • Message-passing logical layer
  • Parallel physical layer
  • Wide variety of adjustable model parameters
  • Physical, logical, transport layers
  • Supports error injection andrecovery (CRC
    failures)
  • Two methods of flow control
  • Transmitter-controlled
  • Receiver-controlled
  • 6- and 8-port central memory switch model

RapidIO four-switch backplane
10
Network Model Validations
  • Ping program
  • Message sizes 1 byte to 8 MB
  • Average of 1000 iterations for each message size
  • Compare experimental and simulation throughputs
  • SCI
  • Average error 3.3
  • InfiniBand
  • Average error 1.211
  • TCP
  • Average error 1.64

11
Experimental Setup
  • System configurations
  • 2-node dual 2.4 GHz Xeon, 1GB RAM
  • SCI 1D ring topology using Dolphin D33X
    adapter, Scali SSP 4.2.1 for MPI
  • InfiniBand InfiniCon switch and HCAs, InfiniCon
    MPI
  • TCP Over Intel Pro/1000 GigE adapter, Cisco
    Catalyst switch, LAM MPI
  • Red Hat Linux 9 with kernel version 2.4.20-8 or
    corresponding patched kernel for SCI/InfiniBand
    support
  • Both hardware measurements and scripts obtained
    with dual Xeon nodes
  • Experiments
  • Average of 25 iterations
  • Matrix multiply
  • Four sizes 50x50, 100x100, 250x250, and 500x500
  • Bench 12
  • 9 permutations of (powers of 2) main table 15,
    18, 20 and substitution table 8, 9, 10

12
Results SCI
  • Matrix Multiply
  • Error higher (7.5) at smaller sizes due to
    small time values and large deviations in
    experimental data
  • Dominated by computation, therefore deviations
    during script generation can greatly affect
    results
  • Bench 12
  • Experimental deviations under 2.5
  • Effective errors within 2 and 5
  • Assumes standard deviation of error incurred
    during each computational time measurement
  • Errors of simulation runs versus experimental
    variation errors
  • High experimental deviation leads to higher error
    in simulation
  • Ideally want simulation error within experimental
    deviation error

13
Results InfiniBand
  • Matrix Multiply
  • High error (25) at small matrix size similar
    to SCI
  • For larger sizes, error is much smaller (less
    than 4)
  • Error percentage decreases as size increases
  • Wall clock simulation times
  • 50x50 slowdown factor of 3650
  • 500x500 slowdown factor of 16
  • Bench 12
  • Errors under 1.5 for each case
  • Experimental deviations under 2
  • Wall clock simulation times
  • 1200 times greater at main table size 215
  • 210 times greater at main table size 220

14
Results TCP
  • Matrix Multiply
  • Error higher (15) at smaller sizes similar to
    SCI and InfiniBand
  • Larger sizes have simulation errors less than the
    experimental deviation
  • Wall clock simulation times
  • 50x50 slowdown factor of 140
  • 500x500 slowdown factor of 6
  • Bench12
  • Error less than 9 for all cases and effective
    errors less than 4
  • Simulation error within experimental deviation
    for table size 215
  • Wall clock simulation times
  • 780 times greater at main table size 215
  • 940 times greater at main table size 220

15
RC Model
  • RC arena has been dominated by experimentation
    but little done in simulation
  • Simulation
  • Predict performance gains on future and more
    advanced systems
  • Determine optimal workload and data distribution
  • Predict performance in emerging realms of RC
  • Resource management
  • Independent RC fabric communication (i.e. without
    processor)
  • Large-scale HPC (e.g. Cray XD1, SRC MAP
    processor, and SGI)
  • Current models
  • RC fabric with management unit (red oval) and
    dynamically created and reconfigured functional
    units (blue oval)
  • Dynamically created RC fabrics (green oval)
  • Interface with FASE for script support (purple
    oval)
  • Multiple host processor/fabric interconnects
    supported
  • RC node figure illustrates PCI bus interface
    (orange oval), but could plug in RIO, IBA, SCI,
    etc
  • Support for inter-fabric communication

RC fabric
RC node
16
Experiments and Results RC Model
Blowfish with FFT results
  • Objective
  • Determine how interconnect enhancements improve
    RC application performance
  • Experiments
  • Blowfish algorithm with FFT
  • Sends 4096-bit encrypted message to RC board over
    PCI bus in 32-bit chunks
  • Experimental system used 33MHz, 32-bit PCI bus
  • Functional unit configurations and
    initializations not included
  • Test Cases
  • Varied PCI specifications
  • 33MHz/32-bit data path
  • 66MHz/64-bit data path
  • 100MHz/64-bit data path
  • Varied bus arbitration penalties
  • 0, 3, 5, 7 clock cycles
  • Analysis of preliminary results
  • 33MHz, 32-bit configuration most closely matches
    experimental system
  • Results closely match (0.4 - 1.4 error)
  • Other configurations close to ideal execution
    time
  • Deduction PCI bus not a major factor in this
    algorithm
  • Due to small amount of data transferred

17
Case Study Low-Locality Applications
  • Proposed Solutions
  • Reference Classification
  • Instruction-based Tyso95, Gonz95, Wong97
  • Our case study considers two variations of
    instruction-based approach
  • AMD Opteron takes instruction-based approach
  • Memory address-based Memi03, John97
  • MTRR is an example of memory address-based
    approach
  • Current Technology Existing Support
  • Intel P6 µarch and later Inno02
  • Uses Memory Type Range Registers (MTRR)
  • 5 possible memory-type classifications
  • Write-through, write-back, write-protect,
    write-combining, uncacheable
  • Specify up to 8 address ranges
  • Accessed through command prompt, assembly
  • AMD processors implement the MTRR as well
    Adva99
  • AMD Opteron Wilk02
  • Added instructions to ISA
  • 3 prefetch instructions
  • Fetch into L1 only, L2 only, or into L1 and L2
  • Streaming store instruction
  • Store directly from write buffer to memory
  • Do not place in cache or replace cache blocks
  • Extensive compiler optimization research
  • Not considering software solutions
  • Optimizations simulated using SimpleScalar
  • Benchmarks include
  • Table Toy (Bench 12)
  • Integer Sort (Bench 2)
  • Integer Selection and Summation (Bench 5)
  • Represent a range of good and bad locality

18
Baseline Results
  • Simulation Parameters
  • L1 data cache
  • 16KB, 2-way set associative, 32B blocks
  • 2 cc access latency
  • L2 unified
  • 256KB, 4-way set associative, 64B blocks
  • 7 cc access latency
  • Main memory
  • 64-bit bus width
  • 60 cc first-word latency, 2 cc each word
  • Benchmark Parameters
  • Bench 2 Integer size 64b Number of integers
    1.5x107
  • Bench 5 Number of integers 108
  • Bench 12 Main Table 226 Sub. Table 217
  • Metrics to measure
  • CPI is main statistic of interest
  • Proportional to execution time ( CLK x CPI x IC
    )
  • L1 and L2 data cache access count, miss count
  • Profile memory access behavior

Baseline CPI
Baseline miss rates
19
Instruction-based Bypassing Dynamic
  • Incorporate Dynamic Bypass Table (DBT) before
    cache
  • DBT design based on that found in WONG97
  • Small, cache-speed memory monitors memory
    reference instructions
  • Table to store n 8B entries
  • 3 fields per entry
  • PC instruction identifier
  • SC saturating counter
  • Increment on miss, decrement on hit
  • If reaches a threshold, make instruction NC
  • RC reference counter
  • Used for replacement policy
  • Incremented each time instruction is executed
  • Hardware requirements
  • (8 x n) bytes storage, L1 cache speed
  • Plus control logic
  • DBT determines instructions with poor locality
  • Table size, saturation, replace. policy affect
    determination
  • Replacement policy is least executed or fewest
    accesses

Saturating counter m bits, m lt 9
Threshold 255 Threshold
2m - 1
Dynamic Bypass Table
20
Instruction-based Bypassing Dynamic
  • Dynamic solution provides benefit for Bench 12
  • Degrades performance on bench 2
  • Too many instructions saturate
  • Bench 5 barely affected
  • Almost no instructions saturate
  • Simulator outputs DBT contents at end of
    simulation
  • Check to see which instructions saturated
  • Performance w/ multiple saturation values
  • DBT size constant at 1024 entries
  • Higher saturation ? more tolerance before
    declaring NC
  • No more improvement for thresholds above 15
  • Very small threshold too sensitive
  • Effect of varying DBT size
  • Saturate threshold constant at 15
  • For these benchmarks, 128 entries is enough
  • Less entries ? smaller hardware requirements
  • 128 entries 1KB logic

Variable saturation
Variable DBT size
21
Discussion of Results
  • Instruction-based NC approach
  • Bench 2 and 5 not good candidates for NC solution
  • Sources of improved performance from NC
  • Reserve cache space for references with good
    locality
  • No L1/L2 access penalty waiting for hit/miss
    decision
  • Static approach is synonymous with dynamic
  • Add new instructions to ISA, more control over NC
  • No DBT, programmer chooses NC refs pre-compile
  • Static approach less versatile but expected to
    achieve better performance than DBT experiments
    are well underway

Main loop of Bench 12 Static approach
  • Hardware costs of proposed solutions
  • Dynamic
  • Cache-like memory, around 1 KB logic
  • Static
  • New instructions added to ISA
  • Both require word-addressable direct memory port
  • CPI for faster access latencies
  • Reconsidering existing hardware
  • Instruction-based support found in Opteron
  • Only has non-caching store, no matching load
  • MTRR can be used in memory-based approach
  • Would require further research, though MTRR not
    intended for this purpose

Variable NC access latency
22
Conclusions
  • FASE
  • Pre-simulation process characterizes application
  • Captures key parameters of communication events
    (subset of MPI and SHMEM library calls) to drive
    simulation
  • Non-communication areas use timing relies on
    scaling factor for modeling other computational
    components
  • Use hardware for timing FAST
  • Simulation
  • Used to accurately model communication events and
    scale other events
  • Can use any network model in FASE library in a
    system configuration
  • User-definable parameters to customize network
    and computational unit settings
  • Network models
  • All models produced acceptable results with
    errors under 8 in most cases
  • InfiniBand and TCP models showed simulation
    slowdowns on the order of 103
  • Verified RapidIO model with experiments underway
  • Models will be optimized to increase accuracy
    while minimizing simulation time
  • RC model
  • Support for multiple RC nodes, multiple RC
    fabrics on each node, and multiple,
    reconfigurable functional units in each fabric
  • Model still in infancy stage, but results
    obtained are promising
  • Need to test other algorithms, vary other
    parameters, and explore advantages of
    inter-fabric/inter-node communication potential
    without host intervention
  • Low-locality case study

23
Open Issues For FASE
  • Dynamic applications
  • Scripts inherently static cannot capture any
    dynamic information (analogy snapshot vs. live
    video)
  • Potential aid Hardware-in-the-loop
  • e.g. Pause application and send data to MLD
    MLD sends back information application resumes
  • Instrumentation and processor modeling
  • Current instrumented code times computation and
    affects actual program execution
  • e.g. Extra code can cause cache misses that
    would otherwise not happen
  • Potential solution Eliminate timing and relate
    computation to source or assembly code
  • e.g. Count loop iterations with knowledge of
    instruction types and number in loop
  • Model memory hierarchy to capture memory related
    performance issues
  • Prediction
  • Pre-simulation/script generation results in data
    for one specific case (e.g. 2 processor system)
  • Potential aid 1 Extract information on task
    assignment in relation to system size
  • Potential aid 2 Analyze trends and extrapolate
  • Scaling factor determined by fitting curves to
    data points of specific application (assuming
    timing of computation is not replaced) with
    respect to data set size
  • Potential solution Classify application in terms
    of limiting factor (e.g. memory, cpu, io, etc)
    and use representative benchmarks

24
Future Work
  • FASE
  • Support other programming languages
  • More completely support MPI and SHMEM programming
    languages
  • Devise scheme to support UPC
  • Model other components
  • Continue enhancing RC models
  • Potential modeling of memory hierarchy, storage
    devices, WAN and grid computing components, etc.
  • Optimize existing network models for speed
    without sacrificing accuracy
  • Run experiments with larger systems
  • Devise roadmap for solving/considering issues in
    previous slide
  • Low-locality case study
  • Many simple variations of proposed solutions
  • Memory address-based classification
  • Monitor additional/different metrics besides NC
  • Extend scalar simulations to include
    multi-processor systems
  • Benchmarks intended to be run on parallel
    machines
  • Extended memory hierarchy adds another layer of
    complexity to locality
  • Possible integration with FASE, model memory
    hierarchy in node components

25
References
  • Adva99 Advanced Micro Devices, Inc.,
    Implementation of Write Allocate in the K86
    Processors, AMD Whitepaper, Publication No.
    21326, Feb. 1999
  • Aust97 T. Austin, A Users and Hackers Guide
    to the SimpleScalar Architectural Research Tool
    Set, Intel MicroComputers Research Labs
    (http//www.simplescalar.com/docs.html) Jan.
    1997.
  • Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
    Data Cache with Multiple Caching Strategies
    Tuned to Different Types of Locality,
    Department of Computer Architecture, University
    of Catalunya Polytechnical, Barcelona 1995.
  • Inno02 R. Innocente, HPC on linux clusters
    Nodes and Networks hardware, Tutorial for
    ICTP-School, International School for Advanced
    Studies, Feb. 2002
  • John97 T. Johnson, M. Merten, W. Hwu, Runtime
    Spatial Locality Detection and Optimization,
    Center for Reliable and High-Performance
    Computing, University of Illinois,
    Urbana- Champaign, IL 1997.
  • John98 T. Johnson, D. Connors, W. Hwu,
    Run-time Adaptive Cache Management, Center for
    Reliable and High-Performance Computing,
    University of Illinois, Urbana-Champaign, IL
    1998.
  • Memi03 G. Memik, M. Kandemir, A. Choudary, I
    Kadayif, An Integrated Approach for Improving
    Cache Behavior, Proceedings of the
    Design,Automation and Test in Europe Conference
    and Exhibition, 2003.
  • Tyso95 G. Tyson, M. Farrens, J. Matthews, A.
    Pleszkun, A Modified Approach to Data Cache
    Management, 1995.
  • Wilk02 T. Wilkens, Optimizing for the AMD
    Opteron Processor, AMD Developer Symposium, Oct.
    2002
  • Wong97 W. Wong, Hardware Techniques for Data
    Placement, Department of Computer Science and
    Engineering, University of Washington, Seattle,
    WA Aug 1997.

26
Performance Prediction Setup
  • 2-node 1.33 GHz Athlon, 256MB RAM
  • 3Com FastE adapter, Nortel PassPort switch, MPICH
  • Red Hat Linux 9 with kernel version 2.4.20-8
  • Scaling factor for prediction
  • Sequential version of matrix multiply
  • Sizes analyzed 50x50, 100x100, 250x250,
    500x500, 1000x1000, 2000x2000
  • Average execution time over 25 iterations for
    each size
  • Scaling factor Divided average execution time
    Athlon machine by Xeon execution time
  • Data set size for each case estimated by hand and
    related to scaling factor
  • Performance prediction only conducted using
    substitution table 9
  • Two scaling factors used
  • Constant factor 2
  • Variable factor determined by best-fit quadratic
    equation to data points
  • Equation ax2 bx c
  • a -1.8778e-9, b 1.96274e-4, c 1.908562

27
Performance Prediction Results
  • Performance prediction
  • Large range of error (0.5 to 30)
  • Matrix multiply errors for variable scaling
    factor relatively low
  • Due to use of its scalar version to determine
    scaling factor
  • Error larger with larger matrix size due to
    quadratic curve fitting
  • More complex equation(s) needed
  • Bench 12 errors under 20 and fairly constant for
    both scaling factors
  • Data set size estimation may be too rough
  • Matrix Multiply and Bench 12 too different for
    correlation
  • Different types of applications represented by
    specific bounded programs (i.e. memory bounded
    program must use memory bounded benchmark to get
    scaling factor)
  • More research needed in this area
Write a Comment
User Comments (0)
About PowerShow.com