Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT - PowerPoint PPT Presentation

About This Presentation

Title:

Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT

Description:

Past work supported in part by SRC Contract 1031.001, NSF ... Sheesh Kebab! 8 x 2 cpus x 2-way SMT = '32 shared memory cpus' on the palm. Released in 2000 ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 22

Provided by: caden6

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT

1
Some Challenges in Parallel and
DistributedHardware Design and Programming
Ganesh Gopalakrishnan School of Computing,
University of Utah, Salt Lake City, UT
Past work supported in part by SRC Contract
1031.001, NSF Award 0219805 and an equipment
grant from Intel Corporation
2
Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
3
Background Motivation for (weak) Shared Memory
Consistency models
A Hardware Perspective

Cannot afford to do eager updates across large
SMP systems
Delayed updates allow considerable latitude in
memory consistency
protocol design
? less bugs in protocols
? more complex shared memory consistency
models

Intra-cluster protocols
Chip-level protocols

dir
dir
Inter-cluster protocols
mem
mem
4
Background Programming Models for
Supercomputers
(Diagram courtesy LLNL / IBM)
A likely programming model for ASCI White is four
MPI tasks per node, with four threads per MPI
task. This model exploits both the number of CPUs
per node and each node's switch adapter
bandwidth. Job limits are 4,096 MPI tasks for US
(high speed) protocol and 8,192 MPI tasks for IP
(lower speed).
5
Some Challenges in Shared Memory Processor
Design and SMP / Distributed Programming

Model Checking Cache Coherency / Shared Memory
Consistency
protocols -- ongoing work in our group
Model Checking Distributed Memory programs used
for
Scientific Simulations (MPI programs)
incipient in our group
Runtime Checking under Limited Observability
spent some time
during sabbatical on it

6
Solved Problems in FV for Shared Memory
Consistency

Modeling and Verification of Directory Protocols
for small configurations for Cache Coherency

Unsolved

Scaling industrial coherence protocol verif.
beyond 4 nodes
- State explosion
Parameterized verification with reasonable
automation
- Invariant discovery
Many decidability results are unknown
- Inadequate general interest in the community
Small configuration verification of Shared
Memory Consistency
even for midscale benchmarks
- Added complexity of property being
verified

See tutorial on Shared Memory Consistency Models
and Protocols, by Chou, German, and
Gopalakrishnan, available from http//www.cs.utah
.edu/ganesh/presentations/fmcad04_tutorial2
7
Challenges in producing Dependable and Fast MPI
/ Threads programs

Threads style
- Deal with Locks, Condition Variables,
Re-entrancy,
Thread Cancellation,
MPI
- Deal with the complexity of
Single-program Multiple Data (SPMD)
programming
Performance optimizations to reduce
communication costs
Deal with the complexity of MPI
(MPI-1.has 130 calls
MPI-2 has 180 various flavors of
sends / receives)
Threads and MPI are often used together
MPI libraries are threaded

8
Solved and Unsolved Problems in MPI/Thread
programming

Solved Problems (Avrunin and Siegel (MPI) as
well as our group)
- Modeling MPI library in Promela
- Model-checking simple MPI programs
Unsolved Problems a rather long list, with some
being
- Model-extraction
- Handling Mixed-paradigm programs
- Formal Methods to find / justify
optimizations
- Verifying Reactive aspects / Computational
aspects

9
Needs of an HPC programmer (learned by working
with a domain-expert Prof. Kirby)

Typical HPC program development cycle consists
of
Understand what is being simulated (the
physics, biology, etc).
Develop a mathematical model of relevant
"features" of interest
Generate a numerical model that
Solve numerical model
Usually begins as serial code
Later the numerical model not the serial code
is parallelized
Often best to develop numerical model
thats amenable for
parallelization
At every step, check consistency (e.g.
conservation of energy)
Tune for load-balancing make code
adaptive

10
Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
11
Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor

Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client

OK
MC Client
MC Client
MC Client
12
Where Post-Si Verification fitsin the Hardware
Verification Flow
Specification Validation
Design Verification
Testing for Fabrication Faults
Post-Silicon Verification
Spec
product
Pre-manufacture
Post-manufacture
Does functionality match designed behavior?
13
Post-Si Verification for Cache Protocol Execution

Future
CANNOT Assume there is a front-side bus
CANNOT Record all link traffic
CAN ONLY Generate sets of possible cache states
HOW BEST can one match against designed
behavior?

cpu
cpu
cpu
cpu
Invisible miss traffic
Visible miss traffic
14
Back to our specific problem domain...

Verify the operation of systems at runtime
when we cant see all transactions
Could also be offline analysis of a partial log
of activities

a x c d y b
15
Required Constraint-Solving Approaches

Constraint Solving in the context of Coupled
Reactive Processes

d
a
e
b
c
d
Observed event
a
d
e
Likely cause
a
e
b
b
d
c
a
e
c
b
c
16
Contribution that we can make

Create benchmark problems
Can define tangible measures of success in each
domain
Can work with the industry
Contribute tools and work with other expert
groups

17
Formal Methods

Principal faculty
Konrad Slind (does deductive
verification)
Ganesh Gopalakrishnan (does algorithmic
verification)

18
Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
19
19
20
Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
21
Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor

Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client

OK
MC Client
MC Client
MC Client

Write a Comment

User Comments (0)