Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT - PowerPoint PPT Presentation

About This Presentation
Title:

Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT

Description:

Past work supported in part by SRC Contract 1031.001, NSF ... Sheesh Kebab! 8 x 2 cpus x 2-way SMT = '32 shared memory cpus' on the palm. Released in 2000 ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Some%20Challenges%20in%20Parallel%20and%20Distributed%20Hardware%20Design%20and%20Programming%20Ganesh%20Gopalakrishnan*%20School%20of%20Computing,%20University%20of%20Utah,%20%20Salt%20Lake%20City,%20UT


1
Some Challenges in Parallel and
DistributedHardware Design and Programming
Ganesh Gopalakrishnan School of Computing,
University of Utah, Salt Lake City, UT
Past work supported in part by SRC Contract
1031.001, NSF Award 0219805 and an equipment
grant from Intel Corporation
2
Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
3
Background Motivation for (weak) Shared Memory
Consistency models
A Hardware Perspective
  • Cannot afford to do eager updates across large
    SMP systems
  • Delayed updates allow considerable latitude in
    memory consistency
  • protocol design
  • ? less bugs in protocols
  • ? more complex shared memory consistency
    models

Intra-cluster protocols
Chip-level protocols

dir
dir
Inter-cluster protocols
mem
mem
4
Background Programming Models for
Supercomputers
(Diagram courtesy LLNL / IBM)
A likely programming model for ASCI White is four
MPI tasks per node, with four threads per MPI
task. This model exploits both the number of CPUs
per node and each node's switch adapter
bandwidth. Job limits are 4,096 MPI tasks for US
(high speed) protocol and 8,192 MPI tasks for IP
(lower speed).
5
Some Challenges in Shared Memory Processor
Design and SMP / Distributed Programming
  • Model Checking Cache Coherency / Shared Memory
    Consistency
  • protocols -- ongoing work in our group
  • Model Checking Distributed Memory programs used
    for
  • Scientific Simulations (MPI programs)
    incipient in our group
  • Runtime Checking under Limited Observability
    spent some time
  • during sabbatical on it

6
Solved Problems in FV for Shared Memory
Consistency
  • Modeling and Verification of Directory Protocols
  • for small configurations for Cache Coherency

Unsolved
  • Scaling industrial coherence protocol verif.
    beyond 4 nodes
  • - State explosion
  • Parameterized verification with reasonable
    automation
  • - Invariant discovery
  • Many decidability results are unknown
  • - Inadequate general interest in the community
  • Small configuration verification of Shared
    Memory Consistency
  • even for midscale benchmarks
  • - Added complexity of property being
    verified

See tutorial on Shared Memory Consistency Models
and Protocols, by Chou, German, and
Gopalakrishnan, available from http//www.cs.utah
.edu/ganesh/presentations/fmcad04_tutorial2
7
Challenges in producing Dependable and Fast MPI
/ Threads programs
  • Threads style
  • - Deal with Locks, Condition Variables,
    Re-entrancy,
  • Thread Cancellation,
  • MPI
  • - Deal with the complexity of
  • Single-program Multiple Data (SPMD)
    programming
  • Performance optimizations to reduce
    communication costs
  • Deal with the complexity of MPI
    (MPI-1.has 130 calls
  • MPI-2 has 180 various flavors of
    sends / receives)
  • Threads and MPI are often used together
  • MPI libraries are threaded

8
Solved and Unsolved Problems in MPI/Thread
programming
  • Solved Problems (Avrunin and Siegel (MPI) as
    well as our group)
  • - Modeling MPI library in Promela
  • - Model-checking simple MPI programs
  • Unsolved Problems a rather long list, with some
    being
  • - Model-extraction
  • - Handling Mixed-paradigm programs
  • - Formal Methods to find / justify
    optimizations
  • - Verifying Reactive aspects / Computational
    aspects

9
Needs of an HPC programmer (learned by working
with a domain-expert Prof. Kirby)
  • Typical HPC program development cycle consists
    of
  • Understand what is being simulated (the
    physics, biology, etc).
  • Develop a mathematical model of relevant
    "features" of interest
  • Generate a numerical model that
  • Solve numerical model
  • Usually begins as serial code
  • Later the numerical model not the serial code
    is parallelized
  • Often best to develop numerical model
    thats amenable for
  • parallelization
  • At every step, check consistency (e.g.
    conservation of energy)
  • Tune for load-balancing make code
    adaptive

10
Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
11
Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor

Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client

OK
MC Client
MC Client
MC Client
12
Where Post-Si Verification fitsin the Hardware
Verification Flow
Specification Validation
Design Verification
Testing for Fabrication Faults
Post-Silicon Verification
Spec
product
Pre-manufacture
Post-manufacture
Does functionality match designed behavior?
13
Post-Si Verification for Cache Protocol Execution
  • Future
  • CANNOT Assume there is a front-side bus
  • CANNOT Record all link traffic
  • CAN ONLY Generate sets of possible cache states
  • HOW BEST can one match against designed
    behavior?

cpu
cpu
cpu
cpu
Invisible miss traffic
Visible miss traffic
14
Back to our specific problem domain...
  • Verify the operation of systems at runtime
  • when we cant see all transactions
  • Could also be offline analysis of a partial log
    of activities

a x c d y b
15
Required Constraint-Solving Approaches
  • Constraint Solving in the context of Coupled
    Reactive Processes

d
a
e
b
c
d
Observed event
a
d
e
Likely cause
a
e
b
b
d
c
a
e
c
b
c
16
Contribution that we can make
  • Create benchmark problems
  • Can define tangible measures of success in each
    domain
  • Can work with the industry
  • Contribute tools and work with other expert
    groups

17
Formal Methods
  • Principal faculty
  • Konrad Slind (does deductive
    verification)
  • Ganesh Gopalakrishnan (does algorithmic
    verification)

18
Background Shared Memory and Distributed
Processors
(Photo courtesy LLNL / IBM)
Released in 2000 -- Peak Performance 12.3
teraflops. -- Processors used IBM RS6000 SP
Power3's - 375 MHz. -- There are 8,192 of these
processors -- The total amount of RAM is 6Tb.
-- Two hundred cabinets - area of two basket
ball courts.
http//www.theinquirer.net/?article12145
By Nebojsa Novakovic Thursday 16 October 2003,
0649 NOVA HAS been to the Microprocessor Forum
and captured this picture of POWER5 chief
scientist Balaram Sinharoy holding this eight way
POWER5 MCM with a staggering 144MB of cache.
Sheesh Kebab! 8 x 2 cpus x 2-way SMT 32
shared memory cpus on the palm
19
19
20
Another Domain Expert (Berzins) Adaptive
Mesh-refinement Code is Hard!
(Photo courtesy NHTSA)
21
Under Construction at Utah (students Palmer,
Yang, Barrus)
proctype MPI_Send(chan out, int c)
out!c proctype MPI_Bsend(chan out, int c)
out!c proctype MPI_Isend(chan out, int c)
out!c typedef MPI_Status int MPI_SOURCE
int MPI_TAG int MPI_ERROR
MPI LibraryModel
int y active proctype T1() int x x 1
if x 0 x 2 fi y
x active proctype T2() int x x 2
if y x 1 y 0 fi assert( y
0 )
CIL / MPICC
ProgramModel
Model Extractor

Environment Model
Error Visualization Simulation
Abstraction Refinement
Zing
MC Server
Result Analyzer
MC Client
MC Client
MC Client
MC Client
MC Client
MC Client

OK
MC Client
MC Client
MC Client
Write a Comment
User Comments (0)
About PowerShow.com