Title: The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View
1The Parallel Computing Laboratory AResearch
Agenda based on the Berkeley View
- Krste Asanovic, Ras Bodik, Jim Demmel, Tony
Keaveny, Kurt Keutzer, John Kubiatowicz, Edward
Lee, Nelson Morgan, George Necula, Dave
Patterson, Koushik Sen, John Wawrzynek, David
Wessel, and Kathy Yelick
April 28, 2008
2Outline
- Overview of Par Lab
- Motivation Scope
- Driving Applications
- Need for Parallel Libraries and Frameworks
- Parallel Libraries
- Success Metric
- High performance (speed and accuracy)
- Autotuning
- Required Functionality
- Ease of use
- Summary of meeting goals, other talks
- Identify opportunities for collaboration
3Outline
- Overview of Par Lab
- Motivation Scope
- Driving Applications
- Need for Parallel Libraries and Frameworks
- Parallel Libraries
- Success Metric
- High performance (speed and accuracy)
- Autotuning
- Required functionality
- Ease of use
- Summary of meeting goals, other talks
- Identify opportunities for collaboration
4A Parallel Revolution, Ready or Not
- Old Moores Law is over
- No more doubling speed of sequential code every
18 months - New Moores Law is here
- 2X processors (cores) per chip every technology
generation, but same clock rate - Sea change for HW SW industries since changing
the model of programming and debugging
5Motif" Popularity (Red Hot ? Blue Cool)
- How do compelling apps relate to 13 motifs?
-
6Motif" Popularity (Red Hot ? Blue Cool)
- How do compelling apps relate to 13 motifs?
-
7Choosing Driving Applications
- Who needs 100 cores to run M/S Word?
- Need compelling apps that use 100s of cores
- How did we pick applications?
- Enthusiastic expert application partner, leader
in field, promise to help design, use, evaluate
our technology - Compelling in terms of likely market or social
impact, with short term feasibility and longer
term potential - Requires significant speed-up, or a smaller,
more efficient platform to work as intended - As a whole, applications cover the most important
- Platforms (handheld, laptop, games)
- Markets (consumer, business, health)
8Compelling Client Applications
Music/Hearing
Robust Speech Input
Parallel Browser
Personal Health
9Motif" Popularity (Red Hot ? Blue Cool)
- How do compelling apps relate to 13 motifs?
-
10Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
11Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
12Developing Parallel Software
- 2 types of programmers ? 2 layers
- Efficiency Layer (10 of todays programmers)
- Expert programmers build Frameworks Libraries,
Hypervisors, - Bare metal efficiency possible at Efficiency
Layer - Productivity Layer (90 of todays programmers)
- Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries - Frameworks libraries composed to form app
frameworks - Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC) - Talk by Kathy Yelick
13Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
14Outline
- Overview of Par Lab
- Motivation Scope
- Driving Applications
- Need for Parallel Libraries and Frameworks
- Parallel Libraries
- Success Metric
- High performance (speed and accuracy)
- Autotuning
- Required Functionality
- Ease of use
- Summary of meeting goals, other talks
- Identify opportunities for collaboration
15Success Metric - Impact
- LAPACK and ScaLAPACK are widely used
- Adopted by Cray, Fujitsu, HP, IBM, IMSL,
MathWorks, NAG, NEC, SGI, - gt86M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
- 35K hits/day
Xiaoye Li Sparse LU
16High Performance (Speed and Accuracy)
- Matching Algorithms to Architectures (8 talks)
- Autotuning generate fast algorithm
automatically depending on architecture and
problem - Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs - Faster Algorithms (2 talks)
- Symmetric eigenproblem (O(n2) instead of O(n3))
- Sparse LU factorization
- More accurate algorithms (2 talks)
- Either at usual speed, or at any cost
- Structure-exploiting algorithms
- Roots(p) (O(n2) instead of O(n3))
17High Performance (Speed and Accuracy)
- Matching Algorithms to Architectures (8 talks)
- Autotuning generate fast algorithm
automatically depending on architecture and
problem - Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs - Faster Algorithms (2 talks)
- Symmetric eigenproblem (O(n2) instead of O(n3))
- Sparse LU factorization
- More accurate algorithms (2 talks)
- Either at usual speed, or at any cost
- Structure-exploiting algorithms
- Roots(p) (O(n2) instead of O(n3))
18Automatic Performance Tuning
- Writing high performance software is hard
- Ideal get high fraction of peak performance from
one algorithm - Reality Best algorithm (and its implementation)
can depend strongly on the problem, computer
architecture, compiler, - Best choice can depend on knowing a lot of
applied mathematics and computer science - Changes with each new hardware, compiler release
- Goal Automation
- Generate and search a space of algorithms
- Past successes PHiPAC, ATLAS, FFTW, Spiral
- Many conferences, DOE projects,
19The Difficulty of Tuning SpMV
- // y lt-- y Ax
- for all nonzero A(i,j)
- y(i) A(i,j) x(j)
- // Compressed sparse row (CSR)
- for each row i
- t 0
- for krowi to rowi1-1
- t Ak xJk
- yi t
- Exploit 8x8 dense blocks
20Speedups on Itanium 2 The Need for Search
21Speedups on Itanium 2 The Need for Search
22SpMV Performanceraefsky3
23More surprises tuning SpMV
- More complex example
- Example 3x3 blocking
- Logical grid of 3x3 cells
24Extra Work Can Improve Efficiency
- More complex example
- Example 3x3 blocking
- Logical grid of 3x3 cells
- Pad with zeros
- Fill ratio 1.5
- 1.5x as many flops
- On Pentium III
- 1.5x speedup! (2/3 time)
- 1.52 2.25x flop rate
25Autotuned Performance of SpMV
- Clovertown was already fully populated with DIMMs
- Gave Opteron as many DIMMs as Clovertown
- Firmware update for Niagara2
- Array padding to avoid inter-thread conflict
misses - PPEs use 1/3 of Cell chip area
26Autotuning SpMV
- Large search space of possible optimizations
- Large speed ups possible
- Parallelism adds more!
- Later talks
- Sam Williams on tuning SpMV for a variety of
multicore, other platforms - Ankit Jain on easy-to-use system for
incorporating autotuning into applications - Kaushik Datta on tuning special case of stencils
- Rajesh Nishtala on tuning collection
communications - But dont you still have to write difficult code
to generate search space?
27Program Synthesis
- Best implementation/data structure hard to write,
identify - Dont do this by hand
- Sketching code generation using 2QBF
Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)
- Talk by Armando Solar-Lezama / Ras Bodik on
program synthesis by sketching, applied to
stencils
28Communication-Avoiding Linear Algebra (CALU)
- Exponentially growing gaps between
- Floating point time ltlt 1/Network BW ltlt Network
Latency - Improving 59/year vs 26/year vs 15/year
- Floating point time ltlt 1/Memory BW ltlt Memory
Latency - Improving 59/year vs 23/year vs
5.5/year - Goal reorganize linear algebra to avoid
communication - Not just hiding communication (speedup ? 2x )
- Arbitrary speedups possible
- Possible for Dense and Sparse Linear Algebra
29CALU Summary (1/4)
- QR or LU decomposition of m x n matrix, mgtgtn
- Parallel implementation
- Conventional O( n log p ) messages
- New O( log p ) messages optimal
- Performance
- QR 5x faster on cluster, LU 7x faster on
cluster - Serial implementation with fast memory of size W
- Conventional O( mn/W ) moves of data from slow
to fast memory - mn/W how many times larger matrix is than fast
memory - New O(1) moves of data
- Performance
- OOC QR only 2x slower than having ? DRAM
- Expect gains with Multicore as well
- Price
- Some redundant computation (but flops are cheap!)
- Different representation of answer for QR (tree
structured) - LU stable in practice so far, but not GEPP
30CALU Summary (2/4)
- QR or LU decomposition of n x n matrix
- Communication lower by factor of b block size
- Lots of speed up possible (modeled and measured)
- Modeled speedups of new QR over ScaLAPACK
- I BM Power 5 (512 procs) up to 9.7x
- Petascale (8K procs) up to 22.9x
- Grid (128 procs) up to 11x
- Measured and modeled speedups of new LU over
ScaLAPACK - IBM Power 5 (Bassi) up to 2.3x speedup
(measured) - Cray XT4 (Franklin) up to 1.8x speedup
(measured) - Petascale (8K procs) up to 80x (modeled)
- Speed limit Cholesky? Matmul?
- Extends to sparse LU
- Communication more dominant, so pay off may be
higher - Speed limit Sparse Cholesky?
- Talk by Xiaoye Li on alternative
31CALU Summary (3/4)
- Take k steps of Krylov subspace method
- GMRES, CG, Lanczos, Arnoldi
- Assume matrix well-partitioned, with modest
surface-to-volume ratio - Parallel implementation
- Conventional O(k log p) messages
- New O(log p) messages - optimal
- Serial implementation
- Conventional O(k) moves of data from slow to
fast memory - New O(1) moves of data optimal
- Can incorporate some preconditioners
- Need to be able to compress interactions
between distant i, j - Hierarchical, semiseparable matrices
- Lots of speed up possible (modeled and measured)
- Price some redundant computation
- Talks by Marghoob Mohiyuddin, Mark Hoemmen
32CALU Summary (4/4)
- Lots of related work
- Some going back to 1960s
- Reports discuss this comprehensively, we will not
- Our contributions
- Several new algorithms, improvements on old ones
- Unifying parallel and sequential approaches to
avoiding communication - Time for these algorithms has come, because of
growing communication costs - Systematic examination of as much of linear
algebra as we can - Why just linear algebra?
33Linear Algebra on GPUs
- Important part of architectural space to explore
- Talk by Vasily Volkov
- NVIDIA has licensed our BLAS (SGEMM)
- Fastest implementations of dense LU, Cholesky, QR
- 80-90 of peak
- Require various optimizations special to GPU
- Use CPU for BLAS1 and BLAS2, GPU for BLAS3
- In LU, replace TRSM by TRTRI GEMM (stable as
GEPP)
34High Performance (Speed and Accuracy)
- Matching Algorithms to Architectures (8 talks)
- Autotuning generate fast algorithm
automatically depending on architecture and
problem - Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs - Faster Algorithms (2 talks)
- Symmetric eigenproblem (O(n2) instead of O(n3))
- Sparse LU factorization
- More accurate algorithms (2 talks)
- Either at usual speed, or at any cost
- Structure-exploiting algorithms
- Roots(p) (O(n2) instead of O(n3))
35Faster Algorithms (Highlights)
- MRRR algorithm for symmetric eigenproblem
- Talk by Osni Marques / B. Parlett / I. Dhillon
/ C. Voemel - 2006 SIAM Linear Algebra Prize for Parlett,
Dhillon - Parallel Sparse LU
- Talk by Xiaoye Li
- Up to 10x faster HQR
- R. Byers / R. Mathias / K. Braman
- 2003 SIAM Linear Algebra Prize
- Extensions to QZ
- B. Kågström / D. Kressner / T. Adlerborn
- Faster Hessenberg, tridiagonal, bidiagonal
reductions - R. van de Geijn / E. Quintana-Orti
- C. Bischof / B. Lang
- G. Howell / C. Fulton
36Collaborators
- UC Berkeley
- Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
Xiaoye Li, Osni Marques, Yozo Hida, Jason
Riedy, Vasily Volkov, Christof Voemel, David
Bindel, undergrads - U Tennessee, Knoxville
- Jack Dongarra, Julien Langou, Julie Langou, Piotr
Luszczek, Stan Tomov, Alfredo Buttari,
Jakub Kurzak - Other Academic Institutions
- UT Austin, UC Davis, CU Denver, Florida IT,
Georgia Tech, U Kansas, U Maryland, North
Carolina SU, UC Santa Barbara - TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen,
U Carlos III Madrid, U
Manchester, U Umeå, U Wuppertal, U Zagreb - Research Institutions
- INRIA, LBL
- Industrial Partners (predating ParLab)
- Cray, HP, Intel, Interactive Supercomputing,
MathWorks, NAG, NVIDIA
37High Performance (Speed and Accuracy)
- Matching Algorithms to Architectures (8 talks)
- Autotuning generate fast algorithm
automatically depending on architecture and
problem - Communication-Avoiding Linear Algebra avoiding
latency and bandwidth costs - Faster Algorithms (2 talks)
- Symmetric eigenproblem (O(n2) instead of O(n3))
- Sparse LU factorization
- More accurate algorithms (2 talks)
- Either at usual speed, or at any cost
- Structure-exploiting algorithms
- Roots(p) (O(n2) instead of O(n3))
38More Accurate Algorithms
- Motivation
- User requests, debugging
- Iterative refinement for Axb, least squares
- Promise the right answer for O(n2) additional
cost - Talk by Jason Riedy
- Arbitrary precision versions of everything
- Using your favorite multiple precision package
- Talk by Yozo Hida
- Jacobi-based SVD
- Faster than QR, can be arbitrarily more accurate
- Drmac / Veselic
39What could go into linear algebra libraries?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate, enlist help!
40What do users want? (1/2)
- Performance, ease of use, functionality,
portability - Composability
- On multicore, expect to implement dense codes via
DAG scheduling (Dongarras PLASMA) - Talk by Krste Asanovic / Heidi Pan on threads
- Reproducibility
- Made challenging by nonassociativity of floating
point - Ongoing collaborations on Driving Apps
- Jointly analyzing needs
- Talk by T. Keaveny on Medical Application
- Other apps so far mostly dense and sparse linear
algebra, FFTs - some interesting structured needs emerging
41What do users want? (2/2)
- DOE/NSF User Survey
- Small but interesting sample at
www.netlib.org/lapack-dev - What matrix sizes do you care about?
- 1000s 34
- 10,000s 26
- 100,000s or 1Ms 26
- How many processors, on distributed memory?
- gt10 34, gt100 31, gt1000 19
- Do you use more than double precision?
- Sometimes or frequently 16
- New graduate program in CSE with 106 faculty from
18 departments - New needs may emerge
42Highlights of New Dense Functionality
- Updating / downdating of factorizations
- Stewart, Langou
- More generalized SVDs
- Bai , Wang
- More generalized Sylvester/Lyapunov eqns
- Kågström, Jonsson, Granat
- Structured eigenproblems
- Selected matrix polynomials
- Mehrmann
43Organizing Linear Algebra
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
44Improved Ease of Use
A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
45Ease of Use One approach
- Easy interfaces vs access to details
- Some users want access to all details, because
- Peak performance matters
- Control over memory allocation
- Other users want simpler interface
- Automatic allocation of workspace
- No universal agreement across systems on easiest
interface - Leave decision to higher level packages
- Keep expert driver / simple driver /
computational routines - Add wrappers for other languages
- Fortran95, Java, Matlab, Python, even C
- Automatic allocation of workspace
- Add wrappers to convert to best parallel layout
46Outline
- Overview of Par Lab
- Motivation Scope
- Driving Applications
- Need for Parallel Libraries and Frameworks
- Parallel Libraries
- Success Metric
- High performance (speed and accuracy)
- Autotuning
- Required Functionality
- Ease of use
- Summary of meeting goals, other talks
47Some goals for the meeting
- Introduce ParLab
- Describe numerical library efforts in detail
- Exchange information
- User needs, tools, goals
- Identify opportunities for collaboration
48Summary of other talks (1)
- Monday, April 28 (531 Cory)
- 1200 - 1245 Jim Demmel - Overview of PAR Lab /
Numerical Libraries - 1245 - 100 Avneesh Sud (Microsoft) -
Introduction to library effort at Microsoft - 100 - 145 Sam Williams/Ankit Jain - Tuning
Sparse-matrix-vector multiply/Parallel OSKI - 145 150 Break
- 150 - 220 Marghoob Mohiyuddin - Avoiding
Communication in SpMV-like kernels - 220 - 250 Mark Hoemmen - Avoiding communication
in Krylov Subspace Methods - 250 - 300 Break
- 300 - 330 Rajesh Nishtala - Tuning collective
communication - 330 - 400 Yozo Hida - High accuracy linear
algebra - 400 425 Jason Riedy - Iterative Refinement in
linear algebra - 425 430 Break
- 430 500 Tony Keaveny - Medical Image Analysis
in PAR Lab - 500 - 530 Ras Bodik/ Armando Solar-Lezama -
Program synthesis by Sketching - 530 - 600 Vasily Volkov - Linear Algebra on
GPUs
49Summary of other talks (2)
- Tuesday, April 29 (Wozniak Lounge)
- 900 - 1000 Kathy Yelick - Programming Systems
for PAR Lab - 1000 - 1030 Kaushik Datta Tuning Stencils
- 1030 - 1100 Xiaoye Li - Parallel sparse LU
factorization - 1100 - 1130 Osni Marques - Parallel Symmetric
Eigensolvers - 1130 1200 Krste Asanovic / Heidi Pan Thread
system
50Extra Slides
51P.S. Parallel Revolution May Fail
- John Hennessy, President, Stanford University,
1/07when we start talking about parallelism
and ease of use of truly parallel computers,
we're talking about a problem that's as hard as
any that computer science has faced. I would
be panicked if I were in industry. - A Conversation with Hennessy Patterson, ACM
Queue Magazine, 410, 1/07. - 100 failure rate of Parallel Computer Companies
- Convex, Encore, Inmos (Transputer), MasPar,
NCUBE, Kendall Square Research, Sequent,
(Silicon Graphics), Thinking Machines, - What if IT goes from a growth industry to
areplacement industry? - If SW cant effectively use 32, 64, ... cores
per chip ? SW no faster on new computer ? Only
buy if computer wears out
525 Themes of Par Lab
- Applications
- Compelling apps drive top-down research agenda
- Identify Common Computational Patterns
- Breaking through disciplinary boundaries
- Developing Parallel Software with Productivity,
Efficiency, and Correctness - 2 Layers Coordination Composition Language
Autotuning - OS and Architecture
- Composable primitives, not packaged solutions
- Deconstruction, Fast barrier synchronization,
Partitions - Diagnosing Power/Performance Bottlenecks
53Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
54Compelling Laptop/Handheld Apps(David Wessel)
- Musicians have an insatiable appetite for
computation - More channels, instruments, more processing,
more interaction! - Latency must be low (5 ms)
- Must be reliable (No clicks)
- Music Enhancer
- Enhanced sound delivery systems for home sound
systems using large microphone and speaker arrays - Laptop/Handheld recreate 3D sound over ear buds
- Hearing Augmenter
- Laptop/Handheld as accelerator for hearing aide
- Novel Instrument User Interface
- New composition and performance systems beyond
keyboards - Input device for Laptop/Handheld
Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
55Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result
- Built around Key Characteristics of personal
databases - Very large number of pictures (gt5K)
- Non-labeled images
- Many pictures of few people
- Complex pictures including people, events,
places, and objects
1000s of images
56Coronary Artery Disease (Tony Keaveny)
After
Before
- Modeling to help patient compliance?
- 450k deaths/year, 16M w. symptom, 72M?BP
- Massively parallel, Real-time variations
- CFD FE solid (non-linear), fluid (Newtonian),
pulsatile - Blood pressure, activity, habitus, cholesterol
57Compelling Laptop/Handheld Apps(Nelson Morgan)
- Meeting Diarist
- Laptops/ Handhelds at meeting coordinate to
create speaker identified, partially transcribed
text diary of meeting
- Teleconference speaker identifier, speech
helper - L/Hs used for teleconference, identifies who is
speaking, closed caption hint of what being
said
58Compelling Laptop/Handheld Apps
- Health Coach
- Since laptop/handheld always with you, Record
images of all meals, weigh plate before and
after, analyze calories consumed so far - What if I order a pizza for my next meal? A
salad? - Since laptop/handheld always with you, record
amount of exercise so far, show how body would
look if maintain this exercise and diet pattern
next 3 months - What would I look like if I regularly ran less?
Further? - Face Recognizer/Name Whisperer
- Laptop/handheld scans faces, matches image
database, whispers name in ear (relies on Content
Based Image Retrieval)
59Parallel Browser (Ras Bodik)
- Goal Desktop quality browsing on handhelds
- Enabled by 4G networks, better output devices
- Bottlenecks to parallelize
- Parsing, Rendering, Scripting
- SkipJax
- Parallel replacement for JavaScript/AJAX
- Based on Browns FlapJax
60Theme 2. Use design patterns instead of
benchmarks? (Kurt Keutzer)
- How invent parallel systems of future when tied
to old code, programming models, CPUs of the
past? - Look for common design patterns (see A Pattern
Language, Christopher Alexander, 1975) - Embedded Computing (42 EEMBC benchmarks)
- Desktop/Server Computing (28 SPEC2006)
- Data Base / Text Mining Software
- Games/Graphics/Vision
- Machine Learning
- High Performance Computing (Original 7 Dwarfs)
- Result 13 Motifs (Use motif instead when go
from 7 to 13)
61Motif" Popularity (Red Hot ? Blue Cool)
- How do compelling apps relate to 13 motifs?
-
624 Valuable Roles of Motifs
- Anti-benchmarks
- Motifs not tied to code or language artifacts ?
encourage innovation in algorithms, languages,
data structures, and/or hardware - Universal, understandable vocabulary, at least at
high level - To talk across disciplinary boundaries
- Bootstrapping Parallelize parallel research
- Allow analysis of HW SW design without waiting
years for full apps - Targets for libraries (see later)
63Themes 1 and 2 Summary
- Application-Driven Research (top down) vs. CS
Solution-Driven Research (bottom up) - Drill down on (initially) 5 app areas to guide
research agenda - Motifs to guide design of apps and implement via
programming framework per motif - Motifs help break through traditional interfaces
- Benchmarking, multidisciplinary conversations,
parallelizing parallel research, and target for
libraries
64Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
65 Theme 3 Developing Parallel SW (Kurt Keutzer
and Kathy Yelick)
- Observation Use Motifs as design patterns
-
- Design patterns are implemented as 2 kinds of
frameworks - Traditional numerical parallel library
- Library where apply supplied function to data in
parallel
- Numerical Libraries
- Dense matrices
- Sparse matrices
- Spectral
- Combinational
- Finite state machines
- Function Application Libraries
- MapReduce
- Dynamic programming
- Backtracking/BB
- N-Body
- (Un) Structured Grid
- Graph traversal, graphical models
- Computations may be viewed at multiple levels
e.g., an FFT library may be built by
instantiating a Map-Reduce library, mapping 1D
FFTs and then transposing (generalized reduce)
66Developing Parallel Software
- 2 types of programmers ? 2 layers
- Efficiency Layer (10 of todays programmers)
- Expert programmers build Frameworks Libraries,
Hypervisors, - Bare metal efficiency possible at Efficiency
Layer - Productivity Layer (90 of todays programmers)
- Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries - Frameworks libraries composed to form app
frameworks - Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC)
67C C Language Requirements(Kathy Yelick)
- Applications specify CC language requirements
- Constructs for creating application frameworks
- Primitive parallelism constructs
- Data parallelism
- Divide-and-conquer parallelism
- Event-driven execution
- Constructs for composing programming frameworks
- Frameworks require independence
- Independence is proven at instantiation with a
variety of techniques - Needs to have low runtime overhead and ability to
measure and control performance
68Ensuring Correctness(Koushek Sen)
- Productivity Layer
- Enforce independence of tasks using decomposition
(partitioning) and copying operators - Goal Remove chance for concurrency errors (e.g.,
nondeterminism from execution order, not just
low-level data races) - Efficiency Layer Check for subtle concurrency
bugs (races, deadlocks, and so on) - Mixture of verification and automated directed
testing - Error detection on frameworks with sequential
code as specification - Automatic detection of races, deadlocks
6921st Century Code Generation(Demmel, Yelick)
- Search space for block sizes (dense matrix)
- Axes are block
dimensions - Temperature is speed
- Problem generating optimal code is like
searching for needle in a haystack - Manycore ? even more diverse
- New approach Auto-tuners
- 1st generate program variations of combinations
of optimizations (blocking, prefetching, ) and
data structures - Then compile and run to heuristically search for
best code for that computer - Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
(DSP), FFT-W (FFT)
- Example Sparse Matrix (SpMV) for 4 multicores
- Fastest SpMV Optimizations BCOO v. BCSR data
structures, NUMA, 16b vs. 32b indicies,
70Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
71Theme 4 OS and Architecture (Krste Asanovic,
John Kubiatowicz)
- Traditional OSes brittle, insecure, memory hogs
- Traditional monolithic OS image uses lots of
precious memory 100s - 1000s times (e.g., AIX
uses GBs of DRAM / CPU) - How can novel architectural support improve
productivity, efficiency, and correctness for
scalable hardware? - Efficiency instead of performance to capture
energy as well as performance - Other challenges power limit, design and
verification costs, low yield, higher error rates - How prototype ideas fast enough to run real SW?
72Deconstructing Operating Systems
- Resurgence of interest in virtual machines
- Hypervisor thin SW layer btw guest OS and HW
- Future OS libraries where only functions needed
are linked into app, on top of thin hypervisor
providing protection and sharing of resources - Opportunity for OS innovation
- Leverage HW partitioning support for very thin
hypervisors, and to allow software full access to
hardware within partition
73HW Solution Small is Beautiful
- Want Software Composable Primitives, Not
Hardware Packaged Solutions - Youre not going fast if youre headed in the
wrong direction - Transactional Memory is usually a Packaged
Solution - Expect modestly pipelined (5- to 9-stage) CPUs,
FPUs, vector, SIMD PEs - Small cores not much slower than large cores
- Parallel is energy efficient path to
performanceCV2F - Lower threshold and supply voltages lowers energy
per op - Configurable Memory Hierarchy (Cell v.
Clovertown) - Can configure on-chip memory as cache or local
RAM - Programmable DMA to move data without occupying
CPU - Cache coherence Mostly HW but SW handlers for
complex cases - Hardware logging of memory writes to allow
rollback
74Build Academic Manycore from FPGAs
- As ? 10 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 100 FPGAs? - 8 32-bit simple soft core RISC at 100MHz in
2004 (Virtex-II) - FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate - HW research community does logic design (gate
shareware) to create out-of-the-box, Manycore - E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 150 MHz/CPU in 2007 - Ideal for heterogeneous chip architectures
- RAMPants 10 faculty at Berkeley, CMU, MIT,
Stanford, Texas, and Washington - Research Accelerator for Multiple Processors
as a vehicle to lure more researchers to
parallel challenge and decrease time to parallel
solution
751008 Core RAMP Blue (Wawrzynek, Krasnov, at
Berkeley)
- 1008 12 32-bit RISC cores / FPGA, 4
FGPAs/board, 21 boards - Simple MicroBlaze soft cores _at_ 90 MHz
- Full star-connection between modules
- NASA Advanced Supercomputing (NAS)
Parallel Benchmarks (all class S) - UPC versions (C plus shared-memory abstraction)
CG, EP, IS, MG - RAMPants creating HW SW for many- core
community using next gen FPGAs - Chuck Thacker Microsoft designing next boards
- 3rd party to manufacture and sell boards 1H08
- Gateware, Software BSD open source
- RAMP Gold for Par Lab new CPU
76Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Legacy OS
OS Libraries Services
Hypervisor
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
Multicore/GPGPU
RAMP Manycore
77Theme 5 Diagnosing Power/ Performance
Bottlenecks (Jim Demmel)
- Collect data on Power/Performance bottlenecks
- Aid autotuner, scheduler, OS in adapting system
- Turn data into useful information that can help
efficiency-level programmer improve system? - E.g., peak power, peak memory BW, CPU,
network - E.g., sample traces of critical paths
- Turn data into useful information that can help
productivity-level programmer improve app? - Where am I spending my time in my program?
- If I change it like this, impact on
Power/Performance? - Rely on machine learning to find correlations in
data, predict Power/Performance?
78Physical Par Lab - 5th Floor Soda
79Impact of Automatic Performance Tuning
- Widely used in performance tuning of Kernels
- ATLAS (PhiPAC) - www.netlib.org/atlas
- Dense BLAS, now in Matlab, many other releases
- FFTW www.fftw.org
- Fast Fourier Transform and similar transforms,
Wilkinson Software Prize - Spiral - www.spiral.net
- Digital Signal Processing
- Communication Collectives (UCB, UTK)
- Rose (LLNL), Bernoulli (Cornell), Telescoping
Languages (Rice), UHFFT (Houston), POET (UTSA), - More projects (PERI,TOPS2,CScADS), conferences,
government reports,