Runtime Data Flow Scheduling of Matrix Computations

About This Presentation

Title:

Runtime Data Flow Scheduling of Matrix Computations

Description:

Runtime Data Flow Scheduling of Matrix Computations Ernie Chan – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 47

Provided by: Erni83

Category:

more less

Transcript and Presenter's Notes

Title: Runtime Data Flow Scheduling of Matrix Computations

1
Runtime Data Flow Scheduling of Matrix
Computations

Ernie Chan

2
Motivation

Solving Linear Systems
Solve for x
A x b
Factorize A O(n3)
P A L U
Forward and Backward substitution O(n2)
L y P b
U x y

3
Goals

Programmability
Use tools provided by FLAME
Parallelism
Directed acyclic graph (DAG) scheduling

4
Outline

LU Factorization with Partial Pivoting
Algorithm-by-Blocks
SuperMatrix Runtime System
Performance
Conclusion
P A L U

5
LU Factorization with Partial Pivoting

Formal Linear Algebra Method Environment (FLAME)
High-level abstractions for expressing linear
algebra algorithms
Application programming interfaces (APIs) for
seamlessly implementing algorithms in code
Library of commonly used linear algebra
operations in libflame

6
(No Transcript)
7
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 1

A12
A11
A21
A22
8
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 1

LUpiv
A12
A22
9
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 1

PIV
A11
A21
10
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 1

TRSM
A11
A21
A22
11
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 1

A12
A11
A21
GEMM
12
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 2

A00
A01
A02
A10
A11
A12
A20
A21
A22
13
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 2

A00
A01
A02
A10
LUpiv
A12
A20
A22
14
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 2

A00
A01
A02
PIV
A11
PIV
A21
15
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 2

A00
A01
A02
A10
A11
TRSM
A20
A21
A22
16
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 2

A00
A01
A02
A10
A11
A12
A20
A21
GEMM
17
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 3

A00
A01
A10
A11
18
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 3

A00
A01
A10
LUpiv
19
LU Factorization with Partial Pivoting

Blocked Algorithm
Iteration 3

A00
A01
PIV
A11
20
Outline

LU Factorization with Partial Pivoting
Algorithm-by-Blocks
SuperMatrix Runtime System
Performance
Conclusion
P A L U

21
Algorithm-by-Blocks

FLASH
Storage-by-blocks

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0, FLA_TL
)
FLA_Part_2x1( p, pT,
pB, 0, FLA_TOP
)
while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
) )
FLA_Repart_2x2_to_3x3( ATL, // ATR,
A00, // A01, A02,
/ / /
/
A10, // A11, A12,
ABL, // ABR,
A20, // A21, A22, 1, 1, FLA_BR )
FLA_Repart_2x1_to_3x1( pT, p0,
/ / /
/
p1,
pB, p2,
1, FLA_BOTTOM )
/---------------------------------------------
---------/
FLA_Merge_2x1( A11,
A21, AB1 )
FLASH_LU_piv( AB1 , p1 )

23
Algorithm-by-Blocks

LU Factorization with Partial Pivoting
Iteration 1

PIV1 TRSM3
LUpiv0
PIV2 TRSM4
PIV1 GEMM5
PIV2 GEMM7
LUpiv0
PIV1 GEMM6
LUpiv0
PIV2 GEMM8
24
Algorithm-by-Blocks

LU Factorization with Partial Pivoting
Iteration 2

LUpiv9
PIV11 TRSM12
PIV10
LUpiv9
PIV10
PIV11 GEMM13
25
Algorithm-by-Blocks

LU Factorization with Partial Pivoting
Iteration 3

PIV16
PIV15
LUpiv14
26
LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
27
Outline

LU Factorization with Partial Pivoting
Algorithm-by-Blocks
SuperMatrix Runtime System
Performance
Conclusion
P A L U

28
SuperMatrix Runtime System

Separation of Concerns
Analyzer
Decomposes subproblems into component tasks
Store tasks in global task queue sequentially
Internally calculates all dependencies between
tasks, which form a directed acyclic graph (DAG),
only using input and output parameters for each
task
Dispatcher
Spawn threads
Schedule and dispatch tasks to threads in parallel

29
SuperMatrix Runtime System

Dispatcher Single Queue
Set of all ready and available tasks
FIFO, priority

PE1

PE0
PEp-1
30
LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
31
SuperMatrix Runtime System

Lookahead
Schedule GEMM5 and GEMM6 tasks first so LUpiv9
can be computed ahead in parallel with GEMM7
and GEMM8
Implemented directly within the code which
increases the complexity and detracts from
programmability
High-Performance LINPACK

32
SuperMatrix Runtime System

Scheduling
Sorting tasks by height of each task in DAG
mimics lookahead
Multiple queues
Data affinity
Work stealing
Macroblocks
Tasks overwriting more than one block at a time

33
Outline

LU Factorization with Partial Pivoting
Algorithm-by-Blocks
SuperMatrix Runtime System
Performance
Conclusion
P A L U

34
Performance

Implementations
SuperMatrix serial BLAS
Partial and incremental pivoting
LAPACK dgetrf multithreaded BLAS
Multithreaded dgetrf
Multithreaded dgemm
Double-precision real floating-point arithmetic
Tuned block size per problem size

35
Performance

Target Architecture Linux
4 socket 2.3 GHz AMD Opteron Quad-Core
ranger.tacc.utexas.edu
3936 SMP nodes
16 cores per node
2 MB shared L3 cache per socket
OpenMP
Intel compiler 10.1
BLAS
GotoBLAS2 1.00

36
Performance
37
Performance
38
Performance

Target Architecture Windows
4 socket 2.4 GHz Intel Xeon E7330 Quad-Core
Windows Server 2008 R2 Enterprise
16 core UMA machine
Two 3 MB shared L2 cache per socket
OpenMP
Microsoft Visual C 2010
BLAS
Intel MKL 10.2

39
Performance
40
Performance

Results
SuperMatrix is competitive with GotoBLAS and MKL
Incremental pivoting ramps up in performance
faster but partial pivoting provides better
asymptotic performance
Linux and Windows platforms attain similar
performance curves

41
Performance

Target Architecture Windows Linux
4 socket 2.66 GHz Intel Dunnington 24 cores
Windows Server 2008 R2 Enterprise
Red Hat 4.1.2-46
16 MB shared L3 cache per socket
OpenMP
Intel compiler 11.1
BLAS
Intel MKL 11.1, 10.2

42
Performance
43
Performance
44
Performance
45
Performance
46
Performance
47
Performance
48
Performance
49
Performance
50
Performance
51
Performance
52
Outline

LU Factorization with Partial Pivoting
Algorithm-by-Blocks
SuperMatrix Runtime System
Performance
Conclusion
P A L U

53
Conclusion

Separation of Concerns
Programmability
Allows us to experiment with different scheduling
algorithms

54
Acknowledgements

Andrew Chapman, Robert van de Geijn
I thank the other members of the FLAME team for
their support
Funding
Microsoft
NSF grants
CCF0540926
CCF0702714

Runtime Data Flow Scheduling of Matrix Computations - PowerPoint PPT Presentation

Runtime Data Flow Scheduling of Matrix Computations

Runtime Data Flow Scheduling of Matrix Computations Ernie Chan – PowerPoint PPT presentation