Title: Analytical Modeling of High Performance Reconfigurable Computers: Prediction and Analysis of System
1Analytical Modeling of High Performance
Reconfigurable ComputersPrediction and Analysis
of System Performance
- A Dissertation Proposal forthe Doctor of
Philosophy Degree in Electrical Engineering - Melissa C. Smith
- March 6, 2002
Research Partially Supported by Air Force
Research Laboratory (AFRL) Center for Information
Technology Research (CITR)
2Outline
- Introduction, Background, Related Work
- Model Methodology Development
- Model Validation
- Model Applications
- Status and Remaining Work
3HPC, RC, HPRC
- High Performance Computing (HPC) Advanced
architectures (vector supercomputers, MPPs, NOWs,
etc.) designed to work collectively on a common
problem. My focus narrowed to distributed
memory, MIMD class machines. - Reconfigurable Computing (RC) Integration of
reconfigurable logic with processor to achieve
hardware-like performance with software-like
flexibility - High Performance Reconfigurable Computing (HPRC)
Marriage of HPC and RC elements
4HPRC Introduction
- Independently, HPC RC demonstrate performance
advantages for many applications - Individually, HPC RC are challenging to program
and utilize
5Problem Statement
HPRC performance analysis must address new issues
(compared to traditional HPC)
Rich design space of HPRC yields potentially
complex performance analysis
- Proposed research will bridge analysis gap
between HPC RC domains by developing - Analytical model for characterizing RC system
performance - Analytical model for HPRC platform
6Modeling Framework
What is it? What can it do?
- Performance analysis design tradeoffs of
architecture - Equations to estimate/predict performance
- Optimization cost functions
- Potential performance metrics design tradeoffs
include
7Modeling Techniques
- Simulation
- Behavior model driven with abstracted workload or
trace data - Also used to validate other models
- Measurement
- HW/SW monitors
- Often used to calibrate/validate other models
- Probes often perturb behavior
- Analytical Modeling
- Mathematical model
- Can be difficult to solve
- Queuing, Petri Nets, Markov
- Why analytical?
- Even simple models can yield accurate results
- Useful for trend analysis
- Even intractable models can provide valuable
insight
8HPC Background Related Work
- Architecture to a large extent determines system
performance (memory, processing nodes,
interconnection network) - My focus on multiprocessor MIMD architectures
with distributed memory systems (examples MPPs,
grid computing, and Beowulf clusters) - HPC Performance Analysis Many studies exist in
literature - Common metrics are Speedup Efficiency
- Peterson, Atallah, Noble, Others
9RC Background Related Work
- ASIC-like performance with software-like
flexibility - Example systems
- Annapolis (Wildforce/Firebird)
- Pilchard (configuration offline)
- ISI (SLAAC)
- Nallatech
- Virtual Computer
- PipeRench
- Many others .
- RC Performance related literature limited
- Are Speedup Efficiency definitions same as HPC?
What does efficiency mean for RC? - Embedded users often concerned with power, area,
cost, etc.
10HPRC Background Related Work
- Cluster of Processing Nodes with RC units
connected by an Interconnection Network - Architecture options include
- RC coupling to Processor
- Number of RC units
- Size of FPGAs
- Number of Nodes
- Network Bandwidth
- Configuration Latency
- Dedicated FPGA network
- Memory for FPGAs
- New fertile territory
- Combine HPC RC metrics or do we need new ones?
- Plan is for an analytical modeling approach
- CHAMPION and NetSolve tools expanded with model
for HPRC use
11Modeling Methodology
- Phase 1 Isolate HPC issues (P-to-P
communication, ntwk setup, synchronization,
serial overhead, load imbalance) - Phase 2 Isolate RC issues (FPGA config/setup,
data distribution, HW/SW compute time load
imbalance) - Phase 3 Combine HPC RC models (load balance
studies overall accuracy) - Iterate for accuracy model generalizations
12HPC Studies
- Goal Isolate communication, synchronization,
relative workstation speed/performance issues - Initial measurements conducted on UT ECE vlsi
cluster
13RC Model (1)
- Single node running a synchronous iterative
algorithm - Goal Model interaction between processor RC
unit - HW/SW trees can be arbitrarily complex (simple
here)
14RC Model (2)
- Runtime for given iteration equal to time for
last task to complete (HW or SW) plus total
overhead
- Model each as random values assume
- Each iteration requires roughly same amount of
computation - Random variables are Independent Identically
Distributed (iid)
15RC Model (3)
- Rewrite HW/SW tasks in terms of total work
(assume HW SW tasks take same time tavg_task)
- Account for HW/SW load imbalance
16RC Model (4)
- SW-only runtime on single processor (HW
acceleration factor s)
17HPRC Model (1)
- Limit study to synchronous iterative algorithms
(focus on communication synchronization) - Begin with dedicated homogeneous system (i.e. no
background load)
18HPRC Model (2)
- Runtime for given iteration equal to time for
last task to complete plus total overhead
- Model each as random value assume
- Each iteration requires roughly same amount of
computation - Random variables are iid
19HPRC Model (3)
- Rewrite tasks in terms of total work
- Account for application and RC load imbalance
20HPRC Model (4)
- SW-only runtime on single processor (HW
acceleration sk)
21Validation Method
AFRL cluster of Pentium/Firebird nodes
UT cluster of Pentium/Pilchard nodes
- CHAMPION Hi-Pass START Demos
- Relatively simple algorithms allowing isolation
of RC interfaces issues - Already implemented on Wildforce allowing focus
to be on the RC system, model, measurements
rather than debugging - Will need port to Pilchard Firebird
architectures - Elementary parallel application to study data
distribution and synchronization/communication
parameters
Champion
k-means
Holography
- Classification Algorithm
- K-means clustering used for data organization
and analysis - Hardware implementation exists with fixed
precision Manhattan distance calculation is
adaptable to our platforms - Permits study of load balance issues due to large
amount of data
- Holography Reconstruction Algorithm
- Uses FFT in the digital reconstruction of
off-axis holograms - Only exits in software need C/VHDL version of
2-D FFT - Permits study validation of complete model
22Validation Measurements
- Possible Sources of Errors
- Parameter measurement errors due to probe effects
- Model assumptions or methods
- Representation of total work
- Representation of load balance
- Need more data
- Un-modeled effects
- Caching
- Packet size optimization
- Other API optimization techiques
- 1st pass model validation using Wildforce Board
on microsys8 during minimal background load
conditions
23Model Applications
- Performance characterization of RC systems
- Performance evaluation for HPRC platform
- Tool for constructing optimizing cost functions
(i.e. power, size, cost, etc.) - Building block for other CAD tools (i.e. task
scheduling, load balancing, CHAMPION NetSolve) - SoC design performance analysis
24Status Remaining Work
- Three papers published others submitted
planned - Phase 1
- Initial communication measurements completed
- Other work ongoing should provide more input
- Phase 2
- Sample application used to gather
characterization data - CHAMPION demos used for validation
- Need port to Firebird Pilchard for more
measurement results - Phase 3
- 1st pass of HPRC model mathematically formulated
- Need access to hardware to begin parameter
measurements, demo development model validation
measurements - Demo Development (assistance from other students)
- CHAMPION need port to Firebird/Pilchard
parallel versions - K-means need implementation on our HW
parallel version - Holography need C/VHDL for 2-D FFT, HW
implementation, parallel version
25Model Status
26Remaining Work
- Develop prediction model for load balance factors
building on previous work by Peterson - Plans to generalize model for heterogeneous
processing sets - Review revise as necessary speedup efficiency
definitions for RC HPRC systems - Complete validation with demos
27Beyond Scope
- FPGA reconfiguration latency studies
- RC dedicated configurable network
- Automated Task scheduling Load balancing
- Integration with CHAMPION NetSolve
28Extra Slides
29Development Environment
- AFRL Heterogeneous HPC cluster of four
Pentium nodes populated with Firebird boards - UT platform cluster of eight Pentium nodes
populated with Pilchard boards - Currently no specific CAD tools available for HPRC
30RC Plots
- Fixed work vary number of tasks load balance
factor
- Fixed work vary configuration time load
balance factor
31HPRC Plots
- 1 RC unit/node 6, 11, 16 nodes
- 2 RC units/node 6, 11, 16 nodes
32Conferences Journals
- Published
- Smith, M. C., Drager, S. L., Pochet, Lt. L., and
Peterson, G. D., High Performance Reconfigurable
Computing Systems, Proceedings of 2001 IEEE
Midwest Symposium on Circuits and Systems, 2001. - Smith, M. C. and Peterson, G. D., Programming
High Performance Reconfigurable Computers (HPRC),
SPIE International Symposium ITCom 2001,
8-19-2001, Denver, CO. - Peterson, G. D. and Smith, M. C., Programming
High Performance Reconfigurable Computers, SSGRR
2001, Rome, Italy. - Submitted
- Smith, M. C. and Peterson, G. D., Analytical
Modeling for High Performance Reconfigurable
Computers, The 2002 International Symposium on
Performance Evaluation of computer and
Telecommunication Systems, SPECTS2002, July
14-19, 2002, San Diego, CA. - Planned
- Journal on Parallel and Distributed Computing
Practices Algorithms Systems and Tools for High
Performance Computing on Heterogeneous Networks,
Submission Due April 30, 2002 - IEEE Design and Test Special Issue
Platform-Based Design of System-on-Chip,
Submission Due May 1, 2002 for November-December
2002 Publication.
33High Pass Demo
- 3x3 high pass filter
- Output pixel value depends on input pixel 8
neighbor pixels - For hardware implementation, a mask of -1/8 is
used
34START ATR Demo (1)
35START ATR Demo (2)
36START ATR Demo (3)
37K-means Demo
- Iteration over fixed number of clusters
- Maintain class center at mean position of member
samples - Computations
- Distance between samples center
- Recalculation of mean
- Simplifications needed for hardware implementation
- Initialize
- Loop until termination condition is met
- For each sample, assign that sample to a class
such that the distance from the sample to the
center of that class is minimized - For each class, recalculate the means of the
class based on the samples in that class - End loop
38K-means Demo Hardware
- Hardware implementation by Lesser et.al.
- Hardware
- Assign pixels to clusters
- Accumulation step of cluster center calculation
- Software
- Compute cluster center
39Holography Demo (1)
- Holography image reconstruction signal processing
block diagram
- Holography image can be represented by
40Holography Demo (2)
- FFT result autocorrelation sidebands
- Isolate information which is in the sideband by
centering one of the sidebands and filtering
41Gnatt Chart