CS 267 Dense Linear Algebra: Possible Class Projects - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

CS 267 Dense Linear Algebra: Possible Class Projects

Description:

Extend Vasily's GPU analysis, code to ATI ... about ATI GPU? Both above aspects interesting. ATI GPU available in ParLab. What are pros, cons of ATI, NVIDIA ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 18
Provided by: DavidE1
Category:

less

Transcript and Presenter's Notes

Title: CS 267 Dense Linear Algebra: Possible Class Projects


1
CS 267 Dense Linear AlgebraPossible Class
Projects
  • James Demmel
  • www.cs.berkeley.edu/demmel/cs267_Spr09

2
Kinds of class projects
  • Try tuning existing (widely used) codes in
    LAPACK, ScaLAPACK or possible future versions
  • Possible impact help many people to run faster
  • Add missing functionality to these libraries
  • Possible impact lots of users want it
  • Experiment with algorithms on new architectures
  • Possible impact What do we need to do
    differently for performance on these platforms?
    Are there any bottlenecks or other problems in
    the architecture? Could they be fixed?
  • Experiment with new software approaches
  • Possible impact Is it easier to write these
    algorithms while getting most of the performance?
    Should we produce future versions of the
    libraries this way?
  • Experiment with new algorithms
  • Possible impact Find a better one!

3
Challenges to Libraries (and parallel SW in
general)
  • Minimizing communication costs
  • Cost of bandwidth and latency (to main memory or
    over a network) growing exponentially compared to
    arithmetic
  • Heterogeneous platforms
  • Different communication costs depending on
    destination
  • Same chip vs different socket vs different board
  • CPU GPU
  • Perform different operations at very different
    rates
  • Dynamic scheduling load balancing
  • Cant always assume each core/processor makes
    constant progress on your task
  • May be faster to grab next available task than
    use predesigned perfectly balanced schedule
  • OS may give, take away resources on the fly
  • Fault tolerance how to recover when one proc
    fails

4
Strassens Matmul on Multicore or GPU
  • Why no Strassen in most libraries?
  • See Baleful Effect of Benchmarks by Prof.
    Kahan
  • Likely to be faster for modest-to-large matrix
    sizes
  • Where is the crossover?
  • May want hybrid switch to O(n3) algorithm for
    certain sizes (smaller)
  • Autotuning?
  • Lots of blocking opportunities as for standard
    matmul
  • What is least amount of data movement possible?
  • How well does it work for the rectangular matmuls
    in LU, QR and Cholesky?
  • Do we need to modify LU, QR or Cholesky to take
    advantage of Strassen (by using a variant that
    multiplies different size matrices)?

5
Review Alternative recursive GE formulation
  • Toledo (1997)
  • Describe without pivoting for simplicity
  • Do left half of matrix, then right half

function L,U RLU (A) assume A is m by n
if (n1) L A/A(1,1), U A(1,1)
else L1,U1 RLU( A(1m , 1n/2))
do left half of A let L11
denote top n/2 rows of L1 A( 1n/2 ,
n/21 n ) L11-1 A( 1n/2 , n/21 n )
update top n/2 rows of right half
of A A( n/21 m, n/21n ) A(
n/21 m, n/21n ) - A( n/21
m, 1n/2 ) A( 1n/2 , n/21 n )
update rest of right half of A
L2,U2 RLU( A(n/21m , n/21n) ) do right
half of A return L1,0L2 and
U1, A(.,.) U2
6
Register-file resident Linear Algebra on GPUs
  • Vasilys results for LU, QR and Cholesky on GPU
    target single large matrices, too large to fit
    just in the fast memory (shared registers) of
    the GPU
  • There is also demand for solving many smaller
    problems in parallel, eg A(i) x(i) b(i) for
    many different A(1),,A(k) and b(1),,b(k)
  • Project Design linear algebra algorithms that
    operate on many different matrices in parallel,
    each small enough to fit in the 64 KB register
    set of each multiprocessor
  • single precision square matrix of dimension n128
  • Question Does possible need to branch
    differently on each multiprocessor (because of
    different pivot orders) matter? If so, is QR
    better than LU?
  • Question Do we need BLAS3 code versions on such
    small matrices, or is BLAS2 enough?

7
Extend Vasilys GPU analysis, code to ATI
  • Vasilys Best Student Paper Award from SC08 had
    two parts
  • Analyzed bottlenecks, speedup possibilities in
    NVIDIA architecture
  • Applied lessons to reorganization of LU, QR,
    Cholesky
  • What about ATI GPU?
  • Both above aspects interesting
  • ATI GPU available in ParLab
  • What are pros, cons of ATI, NVIDIA architectures?
    Others?
  • Do we need to reorganize algorithms differently
    for each, or does one algorithm (perhaps with
    different block sizes, other parameters) work for
    both (which would be simpler)?
  • Other BLAS-like operations on GPU
  • Needed for finite-element analysis

8
Missing Drivers in Sca/LAPACK
9
More missing drivers
10
Missing matrix types in ScaLAPACK
  • Symmetric, Hermitian, triangular
  • Band, Packed
  • Positive Definite
  • Packed
  • Orthogonal, Unitary
  • Packed

11
Tuning the data layout
Layout depends on block size b and processor grid
Pr x Pc Simple layouts easy for user, but bad for
performance
Speedups for using 2D processor grid range from
2x to 8x
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
12
Cost of tuning the data layout, compared to
runtime
Cost of redistributing matrix to optimal layout
is small
Times obtained on 60 processors, Dual AMD
Opteron 1.4GHz Cluster w/Myrinet Interconnect,
2GB Memory
Possible project build wrapper that chooses
fastest layout, whether to convert back and
forth, and hides details from the user.
13
Parallel Eigenvalue Algorithms on GPU
  • Harder to use all BLAS3 than solving Axb, least
    squares
  • Symmetric eigenvalue problem for AAT (SVD
    similar)
  • Find orthogonal Q to transform A QTQT, where
    TTT is tridiagonal (nonzero on main
    diagonal, right above and below
  • Find eigenvals ? diag(?1,,?n)and orthog.
    eigenvecs U of T U?UT
  • Good parallel algorithms cheaper than first
    step
  • Then A (QU) ?(QU)T so orthog. eigenvectors QU,
    eigenvalues ?
  • A QTQT is proposed challenge
  • Use Successive Band Reduction (Sun, Bischof et
    al)
  • Go from A to wide band matrix B via A VBVT , V
    orthogonal
  • All BLAS3, fast on GPU
  • Go from B to tridiagonal T via B WTWT , W
    orthogonal
  • BLAS1 and BLAS2, do it on CPU
  • Find T U?UT as above, then A (VWU) ?(VWU)T
  • Prospect of minimizing communication in theory

14
Experiment with PLASMA for Multicore
  • PLASMA is experimental system for writing,
    scheduling linear algebra algorithms as Directed
    Acyclic Graphs (DAGs)
  • icl.cs.utk.edu/plasma/

15
Fork-Join vs. Dynamic Execution on Multicore
Source Jack Dongarra
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
16
Experiment with PLASMA for Multicore
  • PLASMA is experimental system for writing,
    scheduling linear algebra algorithms as Directed
    Acyclic Graphs (DAGs)
  • icl.cs.utk.edu/plasma/
  • Experiment with PLASMA
  • Implement other factorizations
  • Compare performance
  • To LAPACK with parallel BLAS
  • To ScaLAPACK
  • Evaluate expressiveness for eigenvalue problems
  • Study interaction of scheduler with higher level
    scheduler being designed in ParLab
  • Can PLASMA gracefully accept, give up,
    resources?
  • Perform analogous experiments with UPC, Titanium
    or other PGAS languages

17
Investigate role of Dense Motif in ParLab Apps
  • Initial study (below) showed Dense Linear
    Algebra in
  • Image, Speech, Music
  • Determine what is really needed
  • Functions, problem sizes, performance
    requirements
  • What do we still need to optimize?
Write a Comment
User Comments (0)
About PowerShow.com