The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

About This Presentation

Title:

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

Description:

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 80

Provided by: csBerkel3

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

1
The Parallel Computing Laboratory AResearch
Agenda based on the Berkeley View

Krste Asanovic, Ras Bodik, Jim Demmel, Tony
Keaveny, Kurt Keutzer, John Kubiatowicz, Edward
Lee, Nelson Morgan, George Necula, Dave
Patterson, Koushik Sen, John Wawrzynek, David
Wessel, and Kathy Yelick

April 28, 2008
2
Outline

Overview of Par Lab
Motivation Scope
Driving Applications
Need for Parallel Libraries and Frameworks
Parallel Libraries
Success Metric
High performance (speed and accuracy)
Autotuning
Required Functionality
Ease of use
Summary of meeting goals, other talks
Identify opportunities for collaboration

3
Outline

Overview of Par Lab
Motivation Scope
Driving Applications
Need for Parallel Libraries and Frameworks
Parallel Libraries
Success Metric
High performance (speed and accuracy)
Autotuning
Required functionality
Ease of use
Summary of meeting goals, other talks
Identify opportunities for collaboration

4
A Parallel Revolution, Ready or Not

Old Moores Law is over
No more doubling speed of sequential code every
18 months
New Moores Law is here
2X processors (cores) per chip every technology
generation, but same clock rate
Sea change for HW SW industries since changing
the model of programming and debugging

5
Motif" Popularity (Red Hot ? Blue Cool)

How do compelling apps relate to 13 motifs?

6
Motif" Popularity (Red Hot ? Blue Cool)

How do compelling apps relate to 13 motifs?

7
Choosing Driving Applications

Who needs 100 cores to run M/S Word?
Need compelling apps that use 100s of cores
How did we pick applications?
Enthusiastic expert application partner, leader
in field, promise to help design, use, evaluate
our technology
Compelling in terms of likely market or social
impact, with short term feasibility and longer
term potential
Requires significant speed-up, or a smaller,
more efficient platform to work as intended
As a whole, applications cover the most important
Platforms (handheld, laptop, games)
Markets (consumer, business, health)

8
Compelling Client Applications
Music/Hearing
Robust Speech Input
Parallel Browser
Personal Health
9
Motif" Popularity (Red Hot ? Blue Cool)

How do compelling apps relate to 13 motifs?

10
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
11
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
12
Developing Parallel Software

2 types of programmers ? 2 layers
Efficiency Layer (10 of todays programmers)
Expert programmers build Frameworks Libraries,
Hypervisors,
Bare metal efficiency possible at Efficiency
Layer
Productivity Layer (90 of todays programmers)
Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries
Frameworks libraries composed to form app
frameworks
Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC)
Talk by Kathy Yelick

13
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
14
Outline

Overview of Par Lab
Motivation Scope
Driving Applications
Need for Parallel Libraries and Frameworks
Parallel Libraries
Success Metric
High performance (speed and accuracy)
Autotuning
Required Functionality
Ease of use
Summary of meeting goals, other talks
Identify opportunities for collaboration

15
Success Metric - Impact

LAPACK and ScaLAPACK are widely used
Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI,
gt86M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
35K hits/day

Xiaoye Li Sparse LU
16
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)
Autotuning generate fast algorithm
automatically depending on architecture and
problem
Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs
Faster Algorithms (2 talks)
Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)
Either at usual speed, or at any cost
Structure-exploiting algorithms
Roots(p) (O(n2) instead of O(n3))

17
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)
Autotuning generate fast algorithm
automatically depending on architecture and
problem
Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs
Faster Algorithms (2 talks)
Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)
Either at usual speed, or at any cost
Structure-exploiting algorithms
Roots(p) (O(n2) instead of O(n3))

18
Automatic Performance Tuning

Writing high performance software is hard
Ideal get high fraction of peak performance from
one algorithm
Reality Best algorithm (and its implementation)
can depend strongly on the problem, computer
architecture, compiler,
Best choice can depend on knowing a lot of
applied mathematics and computer science
Changes with each new hardware, compiler release
Goal Automation
Generate and search a space of algorithms
Past successes PHiPAC, ATLAS, FFTW, Spiral
Many conferences, DOE projects,

19
The Difficulty of Tuning SpMV

// y lt-- y Ax
for all nonzero A(i,j)
y(i) A(i,j) x(j)
// Compressed sparse row (CSR)
for each row i
t 0
for krowi to rowi1-1
t Ak xJk
yi t
Exploit 8x8 dense blocks

20
Speedups on Itanium 2 The Need for Search
21
Speedups on Itanium 2 The Need for Search
22
SpMV Performanceraefsky3
23
More surprises tuning SpMV

More complex example
Example 3x3 blocking
Logical grid of 3x3 cells

24
Extra Work Can Improve Efficiency

More complex example
Example 3x3 blocking
Logical grid of 3x3 cells
Pad with zeros
Fill ratio 1.5
1.5x as many flops
On Pentium III
1.5x speedup! (2/3 time)
1.52 2.25x flop rate

25
Autotuned Performance of SpMV

Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2
Array padding to avoid inter-thread conflict
misses
PPEs use 1/3 of Cell chip area

26
Autotuning SpMV

Large search space of possible optimizations
Large speed ups possible
Parallelism adds more!
Later talks
Sam Williams on tuning SpMV for a variety of
multicore, other platforms
Ankit Jain on easy-to-use system for
incorporating autotuning into applications
Kaushik Datta on tuning special case of stencils
Rajesh Nishtala on tuning collection
communications
But dont you still have to write difficult code
to generate search space?

27
Program Synthesis

Best implementation/data structure hard to write,
identify
Dont do this by hand
Sketching code generation using 2QBF

Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)

Talk by Armando Solar-Lezama / Ras Bodik on
program synthesis by sketching, applied to
stencils

28
Communication-Avoiding Linear Algebra (CALU)

Exponentially growing gaps between
Floating point time ltlt 1/Network BW ltlt Network
Latency
Improving 59/year vs 26/year vs 15/year
Floating point time ltlt 1/Memory BW ltlt Memory
Latency
Improving 59/year vs 23/year vs
5.5/year
Goal reorganize linear algebra to avoid
communication
Not just hiding communication (speedup ? 2x )
Arbitrary speedups possible
Possible for Dense and Sparse Linear Algebra

29
CALU Summary (1/4)

QR or LU decomposition of m x n matrix, mgtgtn
Parallel implementation
Conventional O( n log p ) messages
New O( log p ) messages optimal
Performance
QR 5x faster on cluster, LU 7x faster on
cluster
Serial implementation with fast memory of size W
Conventional O( mn/W ) moves of data from slow
to fast memory
mn/W how many times larger matrix is than fast
memory
New O(1) moves of data
Performance
OOC QR only 2x slower than having ? DRAM
Expect gains with Multicore as well
Price
Some redundant computation (but flops are cheap!)
Different representation of answer for QR (tree
structured)
LU stable in practice so far, but not GEPP

30
CALU Summary (2/4)

QR or LU decomposition of n x n matrix
Communication lower by factor of b block size
Lots of speed up possible (modeled and measured)
Modeled speedups of new QR over ScaLAPACK
I BM Power 5 (512 procs) up to 9.7x
Petascale (8K procs) up to 22.9x
Grid (128 procs) up to 11x
Measured and modeled speedups of new LU over
ScaLAPACK
IBM Power 5 (Bassi) up to 2.3x speedup
(measured)
Cray XT4 (Franklin) up to 1.8x speedup
(measured)
Petascale (8K procs) up to 80x (modeled)
Speed limit Cholesky? Matmul?
Extends to sparse LU
Communication more dominant, so pay off may be
higher
Speed limit Sparse Cholesky?
Talk by Xiaoye Li on alternative

31
CALU Summary (3/4)

Take k steps of Krylov subspace method
GMRES, CG, Lanczos, Arnoldi
Assume matrix well-partitioned, with modest
surface-to-volume ratio
Parallel implementation
Conventional O(k log p) messages
New O(log p) messages - optimal
Serial implementation
Conventional O(k) moves of data from slow to
fast memory
New O(1) moves of data optimal
Can incorporate some preconditioners
Need to be able to compress interactions
between distant i, j
Hierarchical, semiseparable matrices
Lots of speed up possible (modeled and measured)
Price some redundant computation
Talks by Marghoob Mohiyuddin, Mark Hoemmen

32
CALU Summary (4/4)

Lots of related work
Some going back to 1960s
Reports discuss this comprehensively, we will not
Our contributions
Several new algorithms, improvements on old ones
Unifying parallel and sequential approaches to
avoiding communication
Time for these algorithms has come, because of
growing communication costs
Systematic examination of as much of linear
algebra as we can
Why just linear algebra?

33
Linear Algebra on GPUs

Important part of architectural space to explore
Talk by Vasily Volkov
NVIDIA has licensed our BLAS (SGEMM)
Fastest implementations of dense LU, Cholesky, QR
80-90 of peak
Require various optimizations special to GPU
Use CPU for BLAS1 and BLAS2, GPU for BLAS3
In LU, replace TRSM by TRTRI GEMM (stable as
GEPP)

34
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)
Autotuning generate fast algorithm
automatically depending on architecture and
problem
Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs
Faster Algorithms (2 talks)
Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)
Either at usual speed, or at any cost
Structure-exploiting algorithms
Roots(p) (O(n2) instead of O(n3))

35
Faster Algorithms (Highlights)

MRRR algorithm for symmetric eigenproblem
Talk by Osni Marques / B. Parlett / I. Dhillon
/ C. Voemel
2006 SIAM Linear Algebra Prize for Parlett,
Dhillon
Parallel Sparse LU
Talk by Xiaoye Li
Up to 10x faster HQR
R. Byers / R. Mathias / K. Braman
2003 SIAM Linear Algebra Prize
Extensions to QZ
B. Kågström / D. Kressner / T. Adlerborn
Faster Hessenberg, tridiagonal, bidiagonal
reductions
R. van de Geijn / E. Quintana-Orti
C. Bischof / B. Lang
G. Howell / C. Fulton

36
Collaborators

UC Berkeley
Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Yozo Hida, Jason
Riedy, Vasily Volkov, Christof Voemel, David
Bindel, undergrads
U Tennessee, Knoxville
Jack Dongarra, Julien Langou, Julie Langou, Piotr
Luszczek, Stan Tomov, Alfredo Buttari,
Jakub Kurzak
Other Academic Institutions
UT Austin, UC Davis, CU Denver, Florida IT,
Georgia Tech, U Kansas, U Maryland, North
Carolina SU, UC Santa Barbara
TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen,
U Carlos III Madrid, U
Manchester, U Umeå, U Wuppertal, U Zagreb
Research Institutions
INRIA, LBL
Industrial Partners (predating ParLab)
Cray, HP, Intel, Interactive Supercomputing,
MathWorks, NAG, NVIDIA

37
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)
Autotuning generate fast algorithm
automatically depending on architecture and
problem
Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs
Faster Algorithms (2 talks)
Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)
Either at usual speed, or at any cost
Structure-exploiting algorithms
Roots(p) (O(n2) instead of O(n3))

38
More Accurate Algorithms

Motivation
User requests, debugging
Iterative refinement for Axb, least squares
Promise the right answer for O(n2) additional
cost
Talk by Jason Riedy
Arbitrary precision versions of everything
Using your favorite multiple precision package
Talk by Yozo Hida
Jacobi-based SVD
Faster than QR, can be arbitrarily more accurate
Drmac / Veselic

39
What could go into linear algebra libraries?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate, enlist help!
40
What do users want? (1/2)

Performance, ease of use, functionality,
portability
Composability
On multicore, expect to implement dense codes via
DAG scheduling (Dongarras PLASMA)
Talk by Krste Asanovic / Heidi Pan on threads
Reproducibility
Made challenging by nonassociativity of floating
point
Ongoing collaborations on Driving Apps
Jointly analyzing needs
Talk by T. Keaveny on Medical Application
Other apps so far mostly dense and sparse linear
algebra, FFTs
some interesting structured needs emerging

41
What do users want? (2/2)

DOE/NSF User Survey
Small but interesting sample at
www.netlib.org/lapack-dev
What matrix sizes do you care about?
1000s 34
10,000s 26
100,000s or 1Ms 26
How many processors, on distributed memory?
gt10 34, gt100 31, gt1000 19
Do you use more than double precision?
Sometimes or frequently 16
New graduate program in CSE with 106 faculty from
18 departments
New needs may emerge

42
Highlights of New Dense Functionality

Updating / downdating of factorizations
Stewart, Langou
More generalized SVDs
Bai , Wang
More generalized Sylvester/Lyapunov eqns
Kågström, Jonsson, Granat
Structured eigenproblems
Selected matrix polynomials
Mehrmann

43
Organizing Linear Algebra
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
44
Improved Ease of Use

Which do you prefer?

A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
45
Ease of Use One approach

Easy interfaces vs access to details
Some users want access to all details, because
Peak performance matters
Control over memory allocation
Other users want simpler interface
Automatic allocation of workspace
No universal agreement across systems on easiest
interface
Leave decision to higher level packages
Keep expert driver / simple driver /
computational routines
Add wrappers for other languages
Fortran95, Java, Matlab, Python, even C
Automatic allocation of workspace
Add wrappers to convert to best parallel layout

46
Outline

Overview of Par Lab
Motivation Scope
Driving Applications
Need for Parallel Libraries and Frameworks
Parallel Libraries
Success Metric
High performance (speed and accuracy)
Autotuning
Required Functionality
Ease of use
Summary of meeting goals, other talks

47
Some goals for the meeting

Introduce ParLab
Describe numerical library efforts in detail
Exchange information
User needs, tools, goals
Identify opportunities for collaboration

48
Summary of other talks (1)

Monday, April 28 (531 Cory)
1200 - 1245 Jim Demmel - Overview of PAR Lab /
Numerical Libraries
1245 - 100 Avneesh Sud (Microsoft) -
Introduction to library effort at Microsoft
100 - 145 Sam Williams/Ankit Jain - Tuning
Sparse-matrix-vector multiply/Parallel OSKI
145 150 Break
150 - 220 Marghoob Mohiyuddin - Avoiding
Communication in SpMV-like kernels
220 - 250 Mark Hoemmen - Avoiding communication
in Krylov Subspace Methods
250 - 300 Break
300 - 330 Rajesh Nishtala - Tuning collective
communication
330 - 400 Yozo Hida - High accuracy linear
algebra
400 425 Jason Riedy - Iterative Refinement in
linear algebra
425 430 Break
430 500 Tony Keaveny - Medical Image Analysis
in PAR Lab
500 - 530 Ras Bodik/ Armando Solar-Lezama -
Program synthesis by Sketching
530 - 600 Vasily Volkov - Linear Algebra on
GPUs

49
Summary of other talks (2)

Tuesday, April 29 (Wozniak Lounge)
900 - 1000 Kathy Yelick - Programming Systems
for PAR Lab
1000 - 1030 Kaushik Datta Tuning Stencils
1030 - 1100 Xiaoye Li - Parallel sparse LU
factorization
1100 - 1130 Osni Marques - Parallel Symmetric
Eigensolvers
1130 1200 Krste Asanovic / Heidi Pan Thread
system

50
Extra Slides
51
P.S. Parallel Revolution May Fail

John Hennessy, President, Stanford University,
1/07when we start talking about parallelism
and ease of use of truly parallel computers,
we're talking about a problem that's as hard as
any that computer science has faced. I would
be panicked if I were in industry.
A Conversation with Hennessy Patterson, ACM
Queue Magazine, 410, 1/07.
100 failure rate of Parallel Computer Companies
Convex, Encore, Inmos (Transputer), MasPar,
NCUBE, Kendall Square Research, Sequent,
(Silicon Graphics), Thinking Machines,
What if IT goes from a growth industry to
areplacement industry?
If SW cant effectively use 32, 64, ... cores
per chip ? SW no faster on new computer ? Only
buy if computer wears out

52
5 Themes of Par Lab

Applications
Compelling apps drive top-down research agenda
Identify Common Computational Patterns
Breaking through disciplinary boundaries
Developing Parallel Software with Productivity,
Efficiency, and Correctness
2 Layers Coordination Composition Language
Autotuning
OS and Architecture
Composable primitives, not packaged solutions
Deconstruction, Fast barrier synchronization,
Partitions
Diagnosing Power/Performance Bottlenecks

53
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
54
Compelling Laptop/Handheld Apps(David Wessel)

Musicians have an insatiable appetite for
computation
More channels, instruments, more processing,
more interaction!
Latency must be low (5 ms)
Must be reliable (No clicks)
Music Enhancer
Enhanced sound delivery systems for home sound
systems using large microphone and speaker arrays
Laptop/Handheld recreate 3D sound over ear buds
Hearing Augmenter
Laptop/Handheld as accelerator for hearing aide
Novel Instrument User Interface
New composition and performance systems beyond
keyboards
Input device for Laptop/Handheld

Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
55
Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result

Built around Key Characteristics of personal
databases
Very large number of pictures (gt5K)
Non-labeled images
Many pictures of few people
Complex pictures including people, events,
places, and objects

1000s of images
56
Coronary Artery Disease (Tony Keaveny)
After
Before

Modeling to help patient compliance?
450k deaths/year, 16M w. symptom, 72M?BP
Massively parallel, Real-time variations
CFD FE solid (non-linear), fluid (Newtonian),
pulsatile
Blood pressure, activity, habitus, cholesterol

57
Compelling Laptop/Handheld Apps(Nelson Morgan)

Meeting Diarist
Laptops/ Handhelds at meeting coordinate to
create speaker identified, partially transcribed
text diary of meeting

Teleconference speaker identifier, speech
helper
L/Hs used for teleconference, identifies who is
speaking, closed caption hint of what being
said

58
Compelling Laptop/Handheld Apps

Health Coach
Since laptop/handheld always with you, Record
images of all meals, weigh plate before and
after, analyze calories consumed so far
What if I order a pizza for my next meal? A
salad?
Since laptop/handheld always with you, record
amount of exercise so far, show how body would
look if maintain this exercise and diet pattern
next 3 months
What would I look like if I regularly ran less?
Further?
Face Recognizer/Name Whisperer
Laptop/handheld scans faces, matches image
database, whispers name in ear (relies on Content
Based Image Retrieval)

59
Parallel Browser (Ras Bodik)

Goal Desktop quality browsing on handhelds
Enabled by 4G networks, better output devices
Bottlenecks to parallelize
Parsing, Rendering, Scripting
SkipJax
Parallel replacement for JavaScript/AJAX
Based on Browns FlapJax

60
Theme 2. Use design patterns instead of
benchmarks? (Kurt Keutzer)

How invent parallel systems of future when tied
to old code, programming models, CPUs of the
past?
Look for common design patterns (see A Pattern
Language, Christopher Alexander, 1975)
Embedded Computing (42 EEMBC benchmarks)
Desktop/Server Computing (28 SPEC2006)
Data Base / Text Mining Software
Games/Graphics/Vision
Machine Learning
High Performance Computing (Original 7 Dwarfs)
Result 13 Motifs (Use motif instead when go
from 7 to 13)

61
Motif" Popularity (Red Hot ? Blue Cool)

How do compelling apps relate to 13 motifs?

62
4 Valuable Roles of Motifs

Anti-benchmarks
Motifs not tied to code or language artifacts ?
encourage innovation in algorithms, languages,
data structures, and/or hardware
Universal, understandable vocabulary, at least at
high level
To talk across disciplinary boundaries
Bootstrapping Parallelize parallel research
Allow analysis of HW SW design without waiting
years for full apps
Targets for libraries (see later)

63
Themes 1 and 2 Summary

Application-Driven Research (top down) vs. CS
Solution-Driven Research (bottom up)
Drill down on (initially) 5 app areas to guide
research agenda
Motifs to guide design of apps and implement via
programming framework per motif
Motifs help break through traditional interfaces
Benchmarking, multidisciplinary conversations,
parallelizing parallel research, and target for
libraries

64
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
65
Theme 3 Developing Parallel SW (Kurt Keutzer
and Kathy Yelick)

Observation Use Motifs as design patterns

Design patterns are implemented as 2 kinds of
frameworks
Traditional numerical parallel library
Library where apply supplied function to data in
parallel

Numerical Libraries
Dense matrices
Sparse matrices
Spectral
Combinational
Finite state machines

Function Application Libraries
MapReduce
Dynamic programming
Backtracking/BB
N-Body
(Un) Structured Grid
Graph traversal, graphical models

Computations may be viewed at multiple levels
e.g., an FFT library may be built by
instantiating a Map-Reduce library, mapping 1D
FFTs and then transposing (generalized reduce)

66
Developing Parallel Software

2 types of programmers ? 2 layers
Efficiency Layer (10 of todays programmers)
Expert programmers build Frameworks Libraries,
Hypervisors,
Bare metal efficiency possible at Efficiency
Layer
Productivity Layer (90 of todays programmers)
Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries
Frameworks libraries composed to form app
frameworks
Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC)

67
C C Language Requirements(Kathy Yelick)

Applications specify CC language requirements
Constructs for creating application frameworks
Primitive parallelism constructs
Data parallelism
Divide-and-conquer parallelism
Event-driven execution
Constructs for composing programming frameworks
Frameworks require independence
Independence is proven at instantiation with a
variety of techniques
Needs to have low runtime overhead and ability to
measure and control performance

68
Ensuring Correctness(Koushek Sen)

Productivity Layer
Enforce independence of tasks using decomposition
(partitioning) and copying operators
Goal Remove chance for concurrency errors (e.g.,
nondeterminism from execution order, not just
low-level data races)
Efficiency Layer Check for subtle concurrency
bugs (races, deadlocks, and so on)
Mixture of verification and automated directed
testing
Error detection on frameworks with sequential
code as specification
Automatic detection of races, deadlocks

69
21st Century Code Generation(Demmel, Yelick)

Search space for block sizes (dense matrix)
Axes are block
dimensions
Temperature is speed

Problem generating optimal code is like
searching for needle in a haystack
Manycore ? even more diverse
New approach Auto-tuners
1st generate program variations of combinations
of optimizations (blocking, prefetching, ) and
data structures
Then compile and run to heuristically search for
best code for that computer
Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
(DSP), FFT-W (FFT)

Example Sparse Matrix (SpMV) for 4 multicores
Fastest SpMV Optimizations BCOO v. BCSR data
structures, NUMA, 16b vs. 32b indicies,

70
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
71
Theme 4 OS and Architecture (Krste Asanovic,
John Kubiatowicz)

Traditional OSes brittle, insecure, memory hogs
Traditional monolithic OS image uses lots of
precious memory 100s - 1000s times (e.g., AIX
uses GBs of DRAM / CPU)
How can novel architectural support improve
productivity, efficiency, and correctness for
scalable hardware?
Efficiency instead of performance to capture
energy as well as performance
Other challenges power limit, design and
verification costs, low yield, higher error rates
How prototype ideas fast enough to run real SW?

72
Deconstructing Operating Systems

Resurgence of interest in virtual machines
Hypervisor thin SW layer btw guest OS and HW
Future OS libraries where only functions needed
are linked into app, on top of thin hypervisor
providing protection and sharing of resources
Opportunity for OS innovation
Leverage HW partitioning support for very thin
hypervisors, and to allow software full access to
hardware within partition

73
HW Solution Small is Beautiful

Want Software Composable Primitives, Not
Hardware Packaged Solutions
Youre not going fast if youre headed in the
wrong direction
Transactional Memory is usually a Packaged
Solution
Expect modestly pipelined (5- to 9-stage) CPUs,
FPUs, vector, SIMD PEs
Small cores not much slower than large cores
Parallel is energy efficient path to
performanceCV2F
Lower threshold and supply voltages lowers energy
per op
Configurable Memory Hierarchy (Cell v.
Clovertown)
Can configure on-chip memory as cache or local
RAM
Programmable DMA to move data without occupying
CPU
Cache coherence Mostly HW but SW handlers for
complex cases
Hardware logging of memory writes to allow
rollback

74
Build Academic Manycore from FPGAs

As ? 10 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 100 FPGAs?
8 32-bit simple soft core RISC at 100MHz in
2004 (Virtex-II)
FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate
HW research community does logic design (gate
shareware) to create out-of-the-box, Manycore
E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 150 MHz/CPU in 2007
Ideal for heterogeneous chip architectures
RAMPants 10 faculty at Berkeley, CMU, MIT,
Stanford, Texas, and Washington
Research Accelerator for Multiple Processors
as a vehicle to lure more researchers to
parallel challenge and decrease time to parallel
solution

75
1008 Core RAMP Blue (Wawrzynek, Krasnov, at
Berkeley)

1008 12 32-bit RISC cores / FPGA, 4
FGPAs/board, 21 boards
Simple MicroBlaze soft cores _at_ 90 MHz
Full star-connection between modules
NASA Advanced Supercomputing (NAS)
Parallel Benchmarks (all class S)
UPC versions (C plus shared-memory abstraction)
CG, EP, IS, MG
RAMPants creating HW SW for many- core
community using next gen FPGAs
Chuck Thacker Microsoft designing next boards
3rd party to manufacture and sell boards 1H08
Gateware, Software BSD open source
RAMP Gold for Par Lab new CPU

76
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Legacy OS
OS Libraries Services
Hypervisor
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
Multicore/GPGPU
RAMP Manycore
77
Theme 5 Diagnosing Power/ Performance
Bottlenecks (Jim Demmel)

Collect data on Power/Performance bottlenecks
Aid autotuner, scheduler, OS in adapting system
Turn data into useful information that can help
efficiency-level programmer improve system?
E.g., peak power, peak memory BW, CPU,
network
E.g., sample traces of critical paths
Turn data into useful information that can help
productivity-level programmer improve app?
Where am I spending my time in my program?
If I change it like this, impact on
Power/Performance?
Rely on machine learning to find correlations in
data, predict Power/Performance?

78
Physical Par Lab - 5th Floor Soda
79
Impact of Automatic Performance Tuning

Widely used in performance tuning of Kernels
ATLAS (PhiPAC) - www.netlib.org/atlas
Dense BLAS, now in Matlab, many other releases
FFTW www.fftw.org
Fast Fourier Transform and similar transforms,
Wilkinson Software Prize
Spiral - www.spiral.net
Digital Signal Processing
Communication Collectives (UCB, UTK)
Rose (LLNL), Bernoulli (Cornell), Telescoping
Languages (Rice), UHFFT (Houston), POET (UTSA),
More projects (PERI,TOPS2,CScADS), conferences,
government reports,

Write a Comment

User Comments (0)

About PowerShow.com

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View - PowerPoint PPT Presentation

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz ... – PowerPoint PPT presentation