Title: Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture
1Implementing Parallel CG Algorithm on the EARTH
Multi-threaded Architecture
- Fei Chen Kevin Theobald
- Ziang Hu Haiping Wu Yan Xie
- Guang R. Gao
- CAPSL
- Electrical and Computer Engineering
- University of Delaware
MASPLAS '03 Mid-Atlantic Student Workshop
on Programming Languages and Systems Saturday,
April 26th, 2003
2Outline
- Introduction
- Algorithm
- Scalability Results
- Conclusion
- Future Work
3Introduction
The Conjugate Gradient (CG) is the most popular
iterative method for solving large systems of
linear equations Ax b 1.
4Introduction (continued)
The matrix A is usually big and sparse, and as
previous studies showed, matrix vector multiply
(MVM) costs 95 CPU time and the other 5 for
vector-vector products (VVP) 2.
Parallel CG Algorithm
Distribute A x among nodes
...
A1x1
A2x2
Anxn
Local MVM
Scale variables reduction broadcast
...
Calculate new local vectors
A1x1
A2x2
Anxn
Redistribute new local vectors
5Introduction (continued)
EARTH (Efficient Architecture for Running
Threads) architecture 3
- EARTH supports fibers, which are non-preemptive
and scheduled in response to dataflow-like
synchronization operations. - Data and control dependences are explicitly
programed (Threaded-C) with EARTH operations
among those fibers 4.
6Algorithm
- Design objective
- Find a matrix blocking method which can reduce
overall communication cost. - Overlap communication and computation to further
reduce communication cost. - We proposed Two-dimensional Pipelined method with
EARTH multi-threading technique.
7Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
8Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
9Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
10Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
11Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
12Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
13Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn
0 Inter-phase communication cost Ct P (P 1)
N / P N (P 1) Overall communication cost
C Cn Ct N (P 1) NP
14Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
15Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
16Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
17Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
18Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
19Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
20Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
21Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
22Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
23Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
24Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
25Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
26Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
27Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
28Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N /
P) P P NP Inter-phase communication cost
Ct 0 Overall communication cost C Cn
Ct NP
29Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
3
4
30Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
31Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
32Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
33Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
34Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
35Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
36Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
37Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
38Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
39Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
40Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N
sqrt(P) Inter-phase communication cost Ct (N
/ P) sqrt(P) P N sqrt(P) Overall
communication cost C 2 N sqrt(P)
41Algorithm (continued)
Multi-threading Technique
1
2
1
2
3
4
1
1
1
1
1
1
1
1
3
4
- There is no data dependence between the two
halves. - When the first half MVM finishes in the Execution
Unit (EU) and the request to send out the first
half of the result vector is writen to Event
Queue (EQ), the second half MVM can start on EU
immediately. - EARTH system dedicates Synchronization Unit (SU)
handling communication requests across the
network, hence communication and computation can
be overlapped.
42Scalability Results
- Test platform
- Chiba City in Argonne National Laboratory (ANL) -
a cluster with 256 dual CPU nodes connected with
fast ethernet - SEMi - a MANNA machine simulator 5
- We used the same matrices as NAS parallel CG
benchmark 6
43Scalability Results (continued)
Threaded_C implementation scalability results on
Chiba City
44Scalability Results (continued)
NAS CG(MPI) benchmark scalability results on
Chiba City
45Scalability Results (continued)
Scalability Comparison with NAS Parallel CG on
Chiba City
46Scalability Results (continued)
Threaded_C implementation scalability results on
SEMi
47Conclusion
- With the two-dimensional pipelined method the
overall communication cost can be reduced to 2 /
sqrt(P) of one-dimensional blocking method
(vertical or horizontal). - The underlying EARTH system, which is a adaptive,
event-driven multi-threaded execution model,
makes it possible to overlap communication and
computation in our implementation. - Notable scalability improvement was achieved by
implementing the two-dimensional pipelined method
on EARTH multi-threaded architecture.
48Future Work
- Port the EARTH runtime system to clusters with
Myrinet connection. - Investigate how to use two-dimensional pipelined
method and EARTH system support to improve the
performance of parallel scientific computing
tools.
49Reference
1 Jonathan R. Shewchuk, AN INTRODUCTION TO THE
CONJUGATE GRADIENT METHOD WITHOUT THE AGONIZING
PAIN. 2 P. Kloos, P. Blaise, F. Mathey, OPENMP
AND MPI PROGRAMMING WITH A CG ALGORITHM, page 5,
CEA, http//www.epcc.ed.ac.uk/ewomp2000/Presentati
ons/KloosSlides.pdf. 3 Herber H. J. Hum,
Olivier Maquelin, Kevin B. Theobald, Xinmin Tian,
Guang R. Gao, and Laurie J. Hendren, A STUDY OF
THE EARTH-MANNA MULTI-THREADED SYSTEM,
International Journal of Parallel Programming,
24(4)319-347, August 1996. 4 Kevin B.
Theobald. EARTH AN EFFICIENT ARCHITECTURE FOR
RUNNING THREADS, PhD thesis, McGill University,
Montreal, Quebec, May 1999. 5 Kevin B.
Theobald, SEMI A SIMULATOR FOR EARTH, MANNA, AND
I860 (VERSION 0.23), CAPSL Technical Memo 27,
March 1, 1999. In ftp//ftp.capsl.udel.edu/pub/doc
/memos. 6 R. C. Agarwal, B. Alpern, L. Carter,
F. G. Gustavson, D. J. Klepacki, R. Lawrence, M.
Zubair, HIGH-PERFORMANCE PARALLEL IMPLEMENTAIONS
OF THE NAS KERNEL BENCHMARKS ON THE IBM SP22,
IBM SYSTEMS JOURNAL, VOL 34, NO 2, 1995.