The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View - PowerPoint PPT Presentation

Loading...

PPT – The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View PowerPoint presentation | free to download - id: 3d8916-NzUyM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View

Description:

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 80
Provided by: csBerkel3
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View


1
The Parallel Computing Laboratory AResearch
Agenda based on the Berkeley View
  • Krste Asanovic, Ras Bodik, Jim Demmel, Tony
    Keaveny, Kurt Keutzer, John Kubiatowicz, Edward
    Lee, Nelson Morgan, George Necula, Dave
    Patterson, Koushik Sen, John Wawrzynek, David
    Wessel, and Kathy Yelick

April 28, 2008
2
Outline
  • Overview of Par Lab
  • Motivation Scope
  • Driving Applications
  • Need for Parallel Libraries and Frameworks
  • Parallel Libraries
  • Success Metric
  • High performance (speed and accuracy)
  • Autotuning
  • Required Functionality
  • Ease of use
  • Summary of meeting goals, other talks
  • Identify opportunities for collaboration

3
Outline
  • Overview of Par Lab
  • Motivation Scope
  • Driving Applications
  • Need for Parallel Libraries and Frameworks
  • Parallel Libraries
  • Success Metric
  • High performance (speed and accuracy)
  • Autotuning
  • Required functionality
  • Ease of use
  • Summary of meeting goals, other talks
  • Identify opportunities for collaboration

4
A Parallel Revolution, Ready or Not
  • Old Moores Law is over
  • No more doubling speed of sequential code every
    18 months
  • New Moores Law is here
  • 2X processors (cores) per chip every technology
    generation, but same clock rate
  • Sea change for HW SW industries since changing
    the model of programming and debugging

5
Motif" Popularity (Red Hot ? Blue Cool)
  • How do compelling apps relate to 13 motifs?

6
Motif" Popularity (Red Hot ? Blue Cool)
  • How do compelling apps relate to 13 motifs?

7
Choosing Driving Applications
  • Who needs 100 cores to run M/S Word?
  • Need compelling apps that use 100s of cores
  • How did we pick applications?
  • Enthusiastic expert application partner, leader
    in field, promise to help design, use, evaluate
    our technology
  • Compelling in terms of likely market or social
    impact, with short term feasibility and longer
    term potential
  • Requires significant speed-up, or a smaller,
    more efficient platform to work as intended
  • As a whole, applications cover the most important
  • Platforms (handheld, laptop, games)
  • Markets (consumer, business, health)

8
Compelling Client Applications
Music/Hearing
Robust Speech Input
Parallel Browser
Personal Health
9
Motif" Popularity (Red Hot ? Blue Cool)
  • How do compelling apps relate to 13 motifs?

10
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
11
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
12
Developing Parallel Software
  • 2 types of programmers ? 2 layers
  • Efficiency Layer (10 of todays programmers)
  • Expert programmers build Frameworks Libraries,
    Hypervisors,
  • Bare metal efficiency possible at Efficiency
    Layer
  • Productivity Layer (90 of todays programmers)
  • Domain experts / Naïve programmers productively
    build parallel apps using frameworks libraries
  • Frameworks libraries composed to form app
    frameworks
  • Effective composition techniques allows the
    efficiency programmers to be highly leveraged ?
    Create language for Composition and Coordination
    (CC)
  • Talk by Kathy Yelick

13
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
14
Outline
  • Overview of Par Lab
  • Motivation Scope
  • Driving Applications
  • Need for Parallel Libraries and Frameworks
  • Parallel Libraries
  • Success Metric
  • High performance (speed and accuracy)
  • Autotuning
  • Required Functionality
  • Ease of use
  • Summary of meeting goals, other talks
  • Identify opportunities for collaboration

15
Success Metric - Impact
  • LAPACK and ScaLAPACK are widely used
  • Adopted by Cray, Fujitsu, HP, IBM, IMSL,
    MathWorks, NAG, NEC, SGI,
  • gt86M web hits _at_ Netlib (incl. CLAPACK, LAPACK95)
  • 35K hits/day

Xiaoye Li Sparse LU
16
High Performance (Speed and Accuracy)
  • Matching Algorithms to Architectures (8 talks)
  • Autotuning generate fast algorithm
    automatically depending on architecture and
    problem
  • Communication-Avoiding Linear Algebra avoiding
    latency and bandwidth costs
  • Faster Algorithms (2 talks)
  • Symmetric eigenproblem (O(n2) instead of O(n3))
  • Sparse LU factorization
  • More accurate algorithms (2 talks)
  • Either at usual speed, or at any cost
  • Structure-exploiting algorithms
  • Roots(p) (O(n2) instead of O(n3))

17
High Performance (Speed and Accuracy)
  • Matching Algorithms to Architectures (8 talks)
  • Autotuning generate fast algorithm
    automatically depending on architecture and
    problem
  • Communication-Avoiding Linear Algebra avoiding
    latency and bandwidth costs
  • Faster Algorithms (2 talks)
  • Symmetric eigenproblem (O(n2) instead of O(n3))
  • Sparse LU factorization
  • More accurate algorithms (2 talks)
  • Either at usual speed, or at any cost
  • Structure-exploiting algorithms
  • Roots(p) (O(n2) instead of O(n3))

18
Automatic Performance Tuning
  • Writing high performance software is hard
  • Ideal get high fraction of peak performance from
    one algorithm
  • Reality Best algorithm (and its implementation)
    can depend strongly on the problem, computer
    architecture, compiler,
  • Best choice can depend on knowing a lot of
    applied mathematics and computer science
  • Changes with each new hardware, compiler release
  • Goal Automation
  • Generate and search a space of algorithms
  • Past successes PHiPAC, ATLAS, FFTW, Spiral
  • Many conferences, DOE projects,

19
The Difficulty of Tuning SpMV
  • // y lt-- y Ax
  • for all nonzero A(i,j)
  • y(i) A(i,j) x(j)
  • // Compressed sparse row (CSR)
  • for each row i
  • t 0
  • for krowi to rowi1-1
  • t Ak xJk
  • yi t
  • Exploit 8x8 dense blocks

20
Speedups on Itanium 2 The Need for Search
21
Speedups on Itanium 2 The Need for Search
22
SpMV Performanceraefsky3
23
More surprises tuning SpMV
  • More complex example
  • Example 3x3 blocking
  • Logical grid of 3x3 cells

24
Extra Work Can Improve Efficiency
  • More complex example
  • Example 3x3 blocking
  • Logical grid of 3x3 cells
  • Pad with zeros
  • Fill ratio 1.5
  • 1.5x as many flops
  • On Pentium III
  • 1.5x speedup! (2/3 time)
  • 1.52 2.25x flop rate

25
Autotuned Performance of SpMV
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

26
Autotuning SpMV
  • Large search space of possible optimizations
  • Large speed ups possible
  • Parallelism adds more!
  • Later talks
  • Sam Williams on tuning SpMV for a variety of
    multicore, other platforms
  • Ankit Jain on easy-to-use system for
    incorporating autotuning into applications
  • Kaushik Datta on tuning special case of stencils
  • Rajesh Nishtala on tuning collection
    communications
  • But dont you still have to write difficult code
    to generate search space?

27
Program Synthesis
  • Best implementation/data structure hard to write,
    identify
  • Dont do this by hand
  • Sketching code generation using 2QBF

Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)
  • Talk by Armando Solar-Lezama / Ras Bodik on
    program synthesis by sketching, applied to
    stencils

28
Communication-Avoiding Linear Algebra (CALU)
  • Exponentially growing gaps between
  • Floating point time ltlt 1/Network BW ltlt Network
    Latency
  • Improving 59/year vs 26/year vs 15/year
  • Floating point time ltlt 1/Memory BW ltlt Memory
    Latency
  • Improving 59/year vs 23/year vs
    5.5/year
  • Goal reorganize linear algebra to avoid
    communication
  • Not just hiding communication (speedup ? 2x )
  • Arbitrary speedups possible
  • Possible for Dense and Sparse Linear Algebra

29
CALU Summary (1/4)
  • QR or LU decomposition of m x n matrix, mgtgtn
  • Parallel implementation
  • Conventional O( n log p ) messages
  • New O( log p ) messages optimal
  • Performance
  • QR 5x faster on cluster, LU 7x faster on
    cluster
  • Serial implementation with fast memory of size W
  • Conventional O( mn/W ) moves of data from slow
    to fast memory
  • mn/W how many times larger matrix is than fast
    memory
  • New O(1) moves of data
  • Performance
  • OOC QR only 2x slower than having ? DRAM
  • Expect gains with Multicore as well
  • Price
  • Some redundant computation (but flops are cheap!)
  • Different representation of answer for QR (tree
    structured)
  • LU stable in practice so far, but not GEPP

30
CALU Summary (2/4)
  • QR or LU decomposition of n x n matrix
  • Communication lower by factor of b block size
  • Lots of speed up possible (modeled and measured)
  • Modeled speedups of new QR over ScaLAPACK
  • I BM Power 5 (512 procs) up to 9.7x
  • Petascale (8K procs) up to 22.9x
  • Grid (128 procs) up to 11x
  • Measured and modeled speedups of new LU over
    ScaLAPACK
  • IBM Power 5 (Bassi) up to 2.3x speedup
    (measured)
  • Cray XT4 (Franklin) up to 1.8x speedup
    (measured)
  • Petascale (8K procs) up to 80x (modeled)
  • Speed limit Cholesky? Matmul?
  • Extends to sparse LU
  • Communication more dominant, so pay off may be
    higher
  • Speed limit Sparse Cholesky?
  • Talk by Xiaoye Li on alternative

31
CALU Summary (3/4)
  • Take k steps of Krylov subspace method
  • GMRES, CG, Lanczos, Arnoldi
  • Assume matrix well-partitioned, with modest
    surface-to-volume ratio
  • Parallel implementation
  • Conventional O(k log p) messages
  • New O(log p) messages - optimal
  • Serial implementation
  • Conventional O(k) moves of data from slow to
    fast memory
  • New O(1) moves of data optimal
  • Can incorporate some preconditioners
  • Need to be able to compress interactions
    between distant i, j
  • Hierarchical, semiseparable matrices
  • Lots of speed up possible (modeled and measured)
  • Price some redundant computation
  • Talks by Marghoob Mohiyuddin, Mark Hoemmen

32
CALU Summary (4/4)
  • Lots of related work
  • Some going back to 1960s
  • Reports discuss this comprehensively, we will not
  • Our contributions
  • Several new algorithms, improvements on old ones
  • Unifying parallel and sequential approaches to
    avoiding communication
  • Time for these algorithms has come, because of
    growing communication costs
  • Systematic examination of as much of linear
    algebra as we can
  • Why just linear algebra?

33
Linear Algebra on GPUs
  • Important part of architectural space to explore
  • Talk by Vasily Volkov
  • NVIDIA has licensed our BLAS (SGEMM)
  • Fastest implementations of dense LU, Cholesky, QR
  • 80-90 of peak
  • Require various optimizations special to GPU
  • Use CPU for BLAS1 and BLAS2, GPU for BLAS3
  • In LU, replace TRSM by TRTRI GEMM (stable as
    GEPP)

34
High Performance (Speed and Accuracy)
  • Matching Algorithms to Architectures (8 talks)
  • Autotuning generate fast algorithm
    automatically depending on architecture and
    problem
  • Communication-Avoiding Linear Algebra avoiding
    latency and bandwidth costs
  • Faster Algorithms (2 talks)
  • Symmetric eigenproblem (O(n2) instead of O(n3))
  • Sparse LU factorization
  • More accurate algorithms (2 talks)
  • Either at usual speed, or at any cost
  • Structure-exploiting algorithms
  • Roots(p) (O(n2) instead of O(n3))

35
Faster Algorithms (Highlights)
  • MRRR algorithm for symmetric eigenproblem
  • Talk by Osni Marques / B. Parlett / I. Dhillon
    / C. Voemel
  • 2006 SIAM Linear Algebra Prize for Parlett,
    Dhillon
  • Parallel Sparse LU
  • Talk by Xiaoye Li
  • Up to 10x faster HQR
  • R. Byers / R. Mathias / K. Braman
  • 2003 SIAM Linear Algebra Prize
  • Extensions to QZ
  • B. Kågström / D. Kressner / T. Adlerborn
  • Faster Hessenberg, tridiagonal, bidiagonal
    reductions
  • R. van de Geijn / E. Quintana-Orti
  • C. Bischof / B. Lang
  • G. Howell / C. Fulton

36
Collaborators
  • UC Berkeley
  • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett,
    Xiaoye Li, Osni Marques, Yozo Hida, Jason
    Riedy, Vasily Volkov, Christof Voemel, David
    Bindel, undergrads
  • U Tennessee, Knoxville
  • Jack Dongarra, Julien Langou, Julie Langou, Piotr
    Luszczek, Stan Tomov, Alfredo Buttari,
    Jakub Kurzak
  • Other Academic Institutions
  • UT Austin, UC Davis, CU Denver, Florida IT,
    Georgia Tech, U Kansas, U Maryland, North
    Carolina SU, UC Santa Barbara
  • TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen,
    U Carlos III Madrid, U
    Manchester, U Umeå, U Wuppertal, U Zagreb
  • Research Institutions
  • INRIA, LBL
  • Industrial Partners (predating ParLab)
  • Cray, HP, Intel, Interactive Supercomputing,
    MathWorks, NAG, NVIDIA

37
High Performance (Speed and Accuracy)
  • Matching Algorithms to Architectures (8 talks)
  • Autotuning generate fast algorithm
    automatically depending on architecture and
    problem
  • Communication-Avoiding Linear Algebra avoiding
    latency and bandwidth costs
  • Faster Algorithms (2 talks)
  • Symmetric eigenproblem (O(n2) instead of O(n3))
  • Sparse LU factorization
  • More accurate algorithms (2 talks)
  • Either at usual speed, or at any cost
  • Structure-exploiting algorithms
  • Roots(p) (O(n2) instead of O(n3))

38
More Accurate Algorithms
  • Motivation
  • User requests, debugging
  • Iterative refinement for Axb, least squares
  • Promise the right answer for O(n2) additional
    cost
  • Talk by Jason Riedy
  • Arbitrary precision versions of everything
  • Using your favorite multiple precision package
  • Talk by Yozo Hida
  • Jacobi-based SVD
  • Faster than QR, can be arbitrarily more accurate
  • Drmac / Veselic

39
What could go into linear algebra libraries?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate, enlist help!
40
What do users want? (1/2)
  • Performance, ease of use, functionality,
    portability
  • Composability
  • On multicore, expect to implement dense codes via
    DAG scheduling (Dongarras PLASMA)
  • Talk by Krste Asanovic / Heidi Pan on threads
  • Reproducibility
  • Made challenging by nonassociativity of floating
    point
  • Ongoing collaborations on Driving Apps
  • Jointly analyzing needs
  • Talk by T. Keaveny on Medical Application
  • Other apps so far mostly dense and sparse linear
    algebra, FFTs
  • some interesting structured needs emerging

41
What do users want? (2/2)
  • DOE/NSF User Survey
  • Small but interesting sample at
    www.netlib.org/lapack-dev
  • What matrix sizes do you care about?
  • 1000s 34
  • 10,000s 26
  • 100,000s or 1Ms 26
  • How many processors, on distributed memory?
  • gt10 34, gt100 31, gt1000 19
  • Do you use more than double precision?
  • Sometimes or frequently 16
  • New graduate program in CSE with 106 faculty from
    18 departments
  • New needs may emerge

42
Highlights of New Dense Functionality
  • Updating / downdating of factorizations
  • Stewart, Langou
  • More generalized SVDs
  • Bai , Wang
  • More generalized Sylvester/Lyapunov eqns
  • Kågström, Jonsson, Granat
  • Structured eigenproblems
  • Selected matrix polynomials
  • Mehrmann

43
Organizing Linear Algebra
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
44
Improved Ease of Use
  • Which do you prefer?

A \ B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B,
IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA,
DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C,
B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR,
BERR, WORK, LWORK, IWORK, LIWORK, INFO)
45
Ease of Use One approach
  • Easy interfaces vs access to details
  • Some users want access to all details, because
  • Peak performance matters
  • Control over memory allocation
  • Other users want simpler interface
  • Automatic allocation of workspace
  • No universal agreement across systems on easiest
    interface
  • Leave decision to higher level packages
  • Keep expert driver / simple driver /
    computational routines
  • Add wrappers for other languages
  • Fortran95, Java, Matlab, Python, even C
  • Automatic allocation of workspace
  • Add wrappers to convert to best parallel layout

46
Outline
  • Overview of Par Lab
  • Motivation Scope
  • Driving Applications
  • Need for Parallel Libraries and Frameworks
  • Parallel Libraries
  • Success Metric
  • High performance (speed and accuracy)
  • Autotuning
  • Required Functionality
  • Ease of use
  • Summary of meeting goals, other talks

47
Some goals for the meeting
  • Introduce ParLab
  • Describe numerical library efforts in detail
  • Exchange information
  • User needs, tools, goals
  • Identify opportunities for collaboration

48
Summary of other talks (1)
  • Monday, April 28 (531 Cory)
  • 1200 - 1245 Jim Demmel - Overview of PAR Lab /
    Numerical Libraries
  • 1245 - 100 Avneesh Sud (Microsoft) -
    Introduction to library effort at Microsoft
  • 100 - 145 Sam Williams/Ankit Jain - Tuning
    Sparse-matrix-vector multiply/Parallel OSKI
  • 145 150 Break
  • 150 - 220 Marghoob Mohiyuddin - Avoiding
    Communication in SpMV-like kernels
  • 220 - 250 Mark Hoemmen - Avoiding communication
    in Krylov Subspace Methods
  • 250 - 300 Break
  • 300 - 330 Rajesh Nishtala - Tuning collective
    communication
  • 330 - 400 Yozo Hida - High accuracy linear
    algebra
  • 400 425 Jason Riedy - Iterative Refinement in
    linear algebra
  • 425 430 Break
  • 430 500 Tony Keaveny - Medical Image Analysis
    in PAR Lab
  • 500 - 530 Ras Bodik/ Armando Solar-Lezama -
    Program synthesis by Sketching
  • 530 - 600 Vasily Volkov - Linear Algebra on
    GPUs

49
Summary of other talks (2)
  • Tuesday, April 29 (Wozniak Lounge)
  • 900 - 1000 Kathy Yelick - Programming Systems
    for PAR Lab
  • 1000 - 1030 Kaushik Datta Tuning Stencils
  • 1030 - 1100 Xiaoye Li - Parallel sparse LU
    factorization
  • 1100 - 1130 Osni Marques - Parallel Symmetric
    Eigensolvers
  • 1130 1200 Krste Asanovic / Heidi Pan Thread
    system

50
Extra Slides
51
P.S. Parallel Revolution May Fail
  • John Hennessy, President, Stanford University,
    1/07when we start talking about parallelism
    and ease of use of truly parallel computers,
    we're talking about a problem that's as hard as
    any that computer science has faced. I would
    be panicked if I were in industry.
  • A Conversation with Hennessy Patterson, ACM
    Queue Magazine, 410, 1/07.
  • 100 failure rate of Parallel Computer Companies
  • Convex, Encore, Inmos (Transputer), MasPar,
    NCUBE, Kendall Square Research, Sequent,
    (Silicon Graphics), Thinking Machines,
  • What if IT goes from a growth industry to
    areplacement industry?
  • If SW cant effectively use 32, 64, ... cores
    per chip ? SW no faster on new computer ? Only
    buy if computer wears out

52
5 Themes of Par Lab
  • Applications
  • Compelling apps drive top-down research agenda
  • Identify Common Computational Patterns
  • Breaking through disciplinary boundaries
  • Developing Parallel Software with Productivity,
    Efficiency, and Correctness
  • 2 Layers Coordination Composition Language
    Autotuning
  • OS and Architecture
  • Composable primitives, not packaged solutions
  • Deconstruction, Fast barrier synchronization,
    Partitions
  • Diagnosing Power/Performance Bottlenecks

53
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
54
Compelling Laptop/Handheld Apps(David Wessel)
  • Musicians have an insatiable appetite for
    computation
  • More channels, instruments, more processing,
    more interaction!
  • Latency must be low (5 ms)
  • Must be reliable (No clicks)
  • Music Enhancer
  • Enhanced sound delivery systems for home sound
    systems using large microphone and speaker arrays
  • Laptop/Handheld recreate 3D sound over ear buds
  • Hearing Augmenter
  • Laptop/Handheld as accelerator for hearing aide
  • Novel Instrument User Interface
  • New composition and performance systems beyond
    keyboards
  • Input device for Laptop/Handheld

Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
55
Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result
  • Built around Key Characteristics of personal
    databases
  • Very large number of pictures (gt5K)
  • Non-labeled images
  • Many pictures of few people
  • Complex pictures including people, events,
    places, and objects

1000s of images
56
Coronary Artery Disease (Tony Keaveny)
After
Before
  • Modeling to help patient compliance?
  • 450k deaths/year, 16M w. symptom, 72M?BP
  • Massively parallel, Real-time variations
  • CFD FE solid (non-linear), fluid (Newtonian),
    pulsatile
  • Blood pressure, activity, habitus, cholesterol

57
Compelling Laptop/Handheld Apps(Nelson Morgan)
  • Meeting Diarist
  • Laptops/ Handhelds at meeting coordinate to
    create speaker identified, partially transcribed
    text diary of meeting
  • Teleconference speaker identifier, speech
    helper
  • L/Hs used for teleconference, identifies who is
    speaking, closed caption hint of what being
    said

58
Compelling Laptop/Handheld Apps
  • Health Coach
  • Since laptop/handheld always with you, Record
    images of all meals, weigh plate before and
    after, analyze calories consumed so far
  • What if I order a pizza for my next meal? A
    salad?
  • Since laptop/handheld always with you, record
    amount of exercise so far, show how body would
    look if maintain this exercise and diet pattern
    next 3 months
  • What would I look like if I regularly ran less?
    Further?
  • Face Recognizer/Name Whisperer
  • Laptop/handheld scans faces, matches image
    database, whispers name in ear (relies on Content
    Based Image Retrieval)

59
Parallel Browser (Ras Bodik)
  • Goal Desktop quality browsing on handhelds
  • Enabled by 4G networks, better output devices
  • Bottlenecks to parallelize
  • Parsing, Rendering, Scripting
  • SkipJax
  • Parallel replacement for JavaScript/AJAX
  • Based on Browns FlapJax

60
Theme 2. Use design patterns instead of
benchmarks? (Kurt Keutzer)
  • How invent parallel systems of future when tied
    to old code, programming models, CPUs of the
    past?
  • Look for common design patterns (see A Pattern
    Language, Christopher Alexander, 1975)
  • Embedded Computing (42 EEMBC benchmarks)
  • Desktop/Server Computing (28 SPEC2006)
  • Data Base / Text Mining Software
  • Games/Graphics/Vision
  • Machine Learning
  • High Performance Computing (Original 7 Dwarfs)
  • Result 13 Motifs (Use motif instead when go
    from 7 to 13)

61
Motif" Popularity (Red Hot ? Blue Cool)
  • How do compelling apps relate to 13 motifs?

62
4 Valuable Roles of Motifs
  • Anti-benchmarks
  • Motifs not tied to code or language artifacts ?
    encourage innovation in algorithms, languages,
    data structures, and/or hardware
  • Universal, understandable vocabulary, at least at
    high level
  • To talk across disciplinary boundaries
  • Bootstrapping Parallelize parallel research
  • Allow analysis of HW SW design without waiting
    years for full apps
  • Targets for libraries (see later)

63
Themes 1 and 2 Summary
  • Application-Driven Research (top down) vs. CS
    Solution-Driven Research (bottom up)
  • Drill down on (initially) 5 app areas to guide
    research agenda
  • Motifs to guide design of apps and implement via
    programming framework per motif
  • Motifs help break through traditional interfaces
  • Benchmarking, multidisciplinary conversations,
    parallelizing parallel research, and target for
    libraries

64
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
65
Theme 3 Developing Parallel SW (Kurt Keutzer
and Kathy Yelick)
  • Observation Use Motifs as design patterns
  • Design patterns are implemented as 2 kinds of
    frameworks
  • Traditional numerical parallel library
  • Library where apply supplied function to data in
    parallel
  • Numerical Libraries
  • Dense matrices
  • Sparse matrices
  • Spectral
  • Combinational
  • Finite state machines
  • Function Application Libraries
  • MapReduce
  • Dynamic programming
  • Backtracking/BB
  • N-Body
  • (Un) Structured Grid
  • Graph traversal, graphical models
  • Computations may be viewed at multiple levels
    e.g., an FFT library may be built by
    instantiating a Map-Reduce library, mapping 1D
    FFTs and then transposing (generalized reduce)

66
Developing Parallel Software
  • 2 types of programmers ? 2 layers
  • Efficiency Layer (10 of todays programmers)
  • Expert programmers build Frameworks Libraries,
    Hypervisors,
  • Bare metal efficiency possible at Efficiency
    Layer
  • Productivity Layer (90 of todays programmers)
  • Domain experts / Naïve programmers productively
    build parallel apps using frameworks libraries
  • Frameworks libraries composed to form app
    frameworks
  • Effective composition techniques allows the
    efficiency programmers to be highly leveraged ?
    Create language for Composition and Coordination
    (CC)

67
C C Language Requirements(Kathy Yelick)
  • Applications specify CC language requirements
  • Constructs for creating application frameworks
  • Primitive parallelism constructs
  • Data parallelism
  • Divide-and-conquer parallelism
  • Event-driven execution
  • Constructs for composing programming frameworks
  • Frameworks require independence
  • Independence is proven at instantiation with a
    variety of techniques
  • Needs to have low runtime overhead and ability to
    measure and control performance

68
Ensuring Correctness(Koushek Sen)
  • Productivity Layer
  • Enforce independence of tasks using decomposition
    (partitioning) and copying operators
  • Goal Remove chance for concurrency errors (e.g.,
    nondeterminism from execution order, not just
    low-level data races)
  • Efficiency Layer Check for subtle concurrency
    bugs (races, deadlocks, and so on)
  • Mixture of verification and automated directed
    testing
  • Error detection on frameworks with sequential
    code as specification
  • Automatic detection of races, deadlocks

69
21st Century Code Generation(Demmel, Yelick)
  • Search space for block sizes (dense matrix)
  • Axes are block
    dimensions
  • Temperature is speed
  • Problem generating optimal code is like
    searching for needle in a haystack
  • Manycore ? even more diverse
  • New approach Auto-tuners
  • 1st generate program variations of combinations
    of optimizations (blocking, prefetching, ) and
    data structures
  • Then compile and run to heuristically search for
    best code for that computer
  • Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
    (DSP), FFT-W (FFT)
  • Example Sparse Matrix (SpMV) for 4 multicores
  • Fastest SpMV Optimizations BCOO v. BCSR data
    structures, NUMA, 16b vs. 32b indicies,

70
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
71
Theme 4 OS and Architecture (Krste Asanovic,
John Kubiatowicz)
  • Traditional OSes brittle, insecure, memory hogs
  • Traditional monolithic OS image uses lots of
    precious memory 100s - 1000s times (e.g., AIX
    uses GBs of DRAM / CPU)
  • How can novel architectural support improve
    productivity, efficiency, and correctness for
    scalable hardware?
  • Efficiency instead of performance to capture
    energy as well as performance
  • Other challenges power limit, design and
    verification costs, low yield, higher error rates
  • How prototype ideas fast enough to run real SW?

72
Deconstructing Operating Systems
  • Resurgence of interest in virtual machines
  • Hypervisor thin SW layer btw guest OS and HW
  • Future OS libraries where only functions needed
    are linked into app, on top of thin hypervisor
    providing protection and sharing of resources
  • Opportunity for OS innovation
  • Leverage HW partitioning support for very thin
    hypervisors, and to allow software full access to
    hardware within partition

73
HW Solution Small is Beautiful
  • Want Software Composable Primitives, Not
    Hardware Packaged Solutions
  • Youre not going fast if youre headed in the
    wrong direction
  • Transactional Memory is usually a Packaged
    Solution
  • Expect modestly pipelined (5- to 9-stage) CPUs,
    FPUs, vector, SIMD PEs
  • Small cores not much slower than large cores
  • Parallel is energy efficient path to
    performanceCV2F
  • Lower threshold and supply voltages lowers energy
    per op
  • Configurable Memory Hierarchy (Cell v.
    Clovertown)
  • Can configure on-chip memory as cache or local
    RAM
  • Programmable DMA to move data without occupying
    CPU
  • Cache coherence Mostly HW but SW handlers for
    complex cases
  • Hardware logging of memory writes to allow
    rollback

74
Build Academic Manycore from FPGAs
  • As ? 10 CPUs will fit in Field Programmable Gate
    Array (FPGA), 1000-CPU system from ? 100 FPGAs?
  • 8 32-bit simple soft core RISC at 100MHz in
    2004 (Virtex-II)
  • FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
    clock rate
  • HW research community does logic design (gate
    shareware) to create out-of-the-box, Manycore
  • E.g., 1000 processor, standard ISA
    binary-compatible, 64-bit, cache-coherent
    supercomputer _at_ ? 150 MHz/CPU in 2007
  • Ideal for heterogeneous chip architectures
  • RAMPants 10 faculty at Berkeley, CMU, MIT,
    Stanford, Texas, and Washington
  • Research Accelerator for Multiple Processors
    as a vehicle to lure more researchers to
    parallel challenge and decrease time to parallel
    solution

75
1008 Core RAMP Blue (Wawrzynek, Krasnov, at
Berkeley)
  • 1008 12 32-bit RISC cores / FPGA, 4
    FGPAs/board, 21 boards
  • Simple MicroBlaze soft cores _at_ 90 MHz
  • Full star-connection between modules
  • NASA Advanced Supercomputing (NAS)
    Parallel Benchmarks (all class S)
  • UPC versions (C plus shared-memory abstraction)
    CG, EP, IS, MG
  • RAMPants creating HW SW for many- core
    community using next gen FPGAs
  • Chuck Thacker Microsoft designing next boards
  • 3rd party to manufacture and sell boards 1H08
  • Gateware, Software BSD open source
  • RAMP Gold for Par Lab new CPU

76
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs
Composition Coordination Language (CCL)
Static Verification
Productivity Layer
CCL Compiler/Interpreter
Parallel Libraries
Parallel Frameworks
Type Systems
Diagnosing Power/Performance
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
OS Libraries Services
Legacy OS
OS
Legacy OS
OS Libraries Services
Hypervisor
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
Multicore/GPGPU
RAMP Manycore
Multicore/GPGPU
RAMP Manycore
77
Theme 5 Diagnosing Power/ Performance
Bottlenecks (Jim Demmel)
  • Collect data on Power/Performance bottlenecks
  • Aid autotuner, scheduler, OS in adapting system
  • Turn data into useful information that can help
    efficiency-level programmer improve system?
  • E.g., peak power, peak memory BW, CPU,
    network
  • E.g., sample traces of critical paths
  • Turn data into useful information that can help
    productivity-level programmer improve app?
  • Where am I spending my time in my program?
  • If I change it like this, impact on
    Power/Performance?
  • Rely on machine learning to find correlations in
    data, predict Power/Performance?

78
Physical Par Lab - 5th Floor Soda
79
Impact of Automatic Performance Tuning
  • Widely used in performance tuning of Kernels
  • ATLAS (PhiPAC) - www.netlib.org/atlas
  • Dense BLAS, now in Matlab, many other releases
  • FFTW www.fftw.org
  • Fast Fourier Transform and similar transforms,
    Wilkinson Software Prize
  • Spiral - www.spiral.net
  • Digital Signal Processing
  • Communication Collectives (UCB, UTK)
  • Rose (LLNL), Bernoulli (Cornell), Telescoping
    Languages (Rice), UHFFT (Houston), POET (UTSA),
  • More projects (PERI,TOPS2,CScADS), conferences,
    government reports,
About PowerShow.com