Title: Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT
1 Some Challenges in Parallel and
DistributedHardware Design and Programming
Ganesh Gopalakrishnan School of Computing,
University of Utah, Salt Lake City, UT
Past work supported in part by SRC Contract
1031.001, NSF Award 0219805 and an equipment
grant from Intel Corporation
2Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
3Background Motivation for (weak) Shared Memory
Consistency models
A Hardware Perspective
- Cannot afford to do eager updates across large
SMP systems - Delayed updates allow considerable latitude in
memory consistency - protocol design
- ? less bugs in protocols
- ? more complex shared memory consistency
models
Intra-cluster protocols
Chip-level protocols
dir
dir
Inter-cluster protocols
mem
mem
4Background Programming Models for
Supercomputers
(Diagram courtesy LLNL / IBM)
A likely programming model for ASCI White is four
MPI tasks per node, with four threads per MPI
task. This model exploits both the number of CPUs
per node and each node's switch adapter
bandwidth. Job limits are 4,096 MPI tasks for US
(high speed) protocol and 8,192 MPI tasks for IP
(lower speed).
5Some Challenges in Shared Memory Processor
Design and SMP / Distributed Programming
- Model Checking Cache Coherency / Shared Memory
Consistency - protocols -- ongoing work in our group
- Model Checking Distributed Memory programs used
for - Scientific Simulations (MPI programs)
incipient in our group - Runtime Checking under Limited Observability
spent some time - during sabbatical on it
6Solved Problems in FV for Shared Memory
Consistency
- Modeling and Verification of Directory Protocols
- for small configurations for Cache Coherency
Unsolved
- Scaling industrial coherence protocol verif.
beyond 4 nodes - - State explosion
- Parameterized verification with reasonable
automation - - Invariant discovery
- Many decidability results are unknown
- - Inadequate general interest in the community
- Small configuration verification of Shared
Memory Consistency - even for midscale benchmarks
- - Added complexity of property being
verified
See tutorial on Shared Memory Consistency Models
and Protocols, by Chou, German, and
Gopalakrishnan, available from http//www.cs.utah
.edu/ganesh/presentations/fmcad04_tutorial2
7Challenges in producing Dependable and Fast MPI
/ Threads programs
- Threads style
- - Deal with Locks, Condition Variables,
Re-entrancy, - Thread Cancellation,
- MPI
- - Deal with the complexity of
- Single-program Multiple Data (SPMD)
programming - Performance optimizations to reduce
communication costs - Deal with the complexity of MPI
(MPI-1.has 130 calls - MPI-2 has 180 various flavors of
sends / receives) -
- Threads and MPI are often used together
- MPI libraries are threaded
8Solved and Unsolved Problems in MPI/Thread
programming
- Solved Problems (Avrunin and Siegel (MPI) as
well as our group) - - Modeling MPI library in Promela
-
- - Model-checking simple MPI programs
- Unsolved Problems a rather long list, with some
being - - Model-extraction
-
- - Handling Mixed-paradigm programs
- - Formal Methods to find / justify
optimizations - - Verifying Reactive aspects / Computational
aspects
9Needs of an HPC programmer (learned by working
with a domain-expert Prof. Kirby)
- Typical HPC program development cycle consists
of - Understand what is being simulated (the
physics, biology, etc). - Develop a mathematical model of relevant
"features" of interest - Generate a numerical model that
- Solve numerical model
- Usually begins as serial code
- Later the numerical model not the serial code
is parallelized - Often best to develop numerical model
thats amenable for - parallelization
- At every step, check consistency (e.g.
conservation of energy) - Tune for load-balancing make code
adaptive -
10Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
11Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor
Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client
OK
MC Client
MC Client
MC Client
12Where Post-Si Verification fitsin the Hardware
Verification Flow
Specification Validation
Design Verification
Testing for Fabrication Faults
Post-Silicon Verification
Spec
product
Pre-manufacture
Post-manufacture
Does functionality match designed behavior?
13Post-Si Verification for Cache Protocol Execution
- Future
- CANNOT Assume there is a front-side bus
- CANNOT Record all link traffic
- CAN ONLY Generate sets of possible cache states
- HOW BEST can one match against designed
behavior?
cpu
cpu
cpu
cpu
Invisible miss traffic
Visible miss traffic
14Back to our specific problem domain...
- Verify the operation of systems at runtime
- when we cant see all transactions
- Could also be offline analysis of a partial log
of activities -
a x c d y b
15Required Constraint-Solving Approaches
- Constraint Solving in the context of Coupled
Reactive Processes
d
a
e
b
c
d
Observed event
a
d
e
Likely cause
a
e
b
b
d
c
a
e
c
b
c
16Contribution that we can make
- Create benchmark problems
- Can define tangible measures of success in each
domain - Can work with the industry
- Contribute tools and work with other expert
groups
17Formal Methods
- Principal faculty
- Konrad Slind (does deductive
verification) - Ganesh Gopalakrishnan (does algorithmic
verification)
18Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
1919
20Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
21Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor
Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client
OK
MC Client
MC Client
MC Client