Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture

Description:

Find a matrix blocking method which can reduce overall communication cost. ... We proposed Two-dimensional Pipelined method with EARTH multi-threading technique. ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 50

Provided by: capsl

Category:

more less

Transcript and Presenter's Notes

Title: Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture

1
Implementing Parallel CG Algorithm on the EARTH
Multi-threaded Architecture

Fei Chen Kevin Theobald
Ziang Hu Haiping Wu Yan Xie
Guang R. Gao
CAPSL
Electrical and Computer Engineering
University of Delaware

MASPLAS '03 Mid-Atlantic Student Workshop
on Programming Languages and Systems Saturday,
April 26th, 2003
2
Outline

Introduction
Algorithm
Scalability Results
Conclusion
Future Work

3
Introduction
The Conjugate Gradient (CG) is the most popular
iterative method for solving large systems of
linear equations Ax b 1.
4
Introduction (continued)
The matrix A is usually big and sparse, and as
previous studies showed, matrix vector multiply
(MVM) costs 95 CPU time and the other 5 for
vector-vector products (VVP) 2.
Parallel CG Algorithm
Distribute A x among nodes
...
A1x1
A2x2
Anxn
Local MVM
Scale variables reduction broadcast
...
Calculate new local vectors
A1x1
A2x2
Anxn
Redistribute new local vectors
5
Introduction (continued)
EARTH (Efficient Architecture for Running
Threads) architecture 3

EARTH supports fibers, which are non-preemptive
and scheduled in response to dataflow-like
synchronization operations.
Data and control dependences are explicitly
programed (Threaded-C) with EARTH operations
among those fibers 4.

6
Algorithm

Design objective
Find a matrix blocking method which can reduce
overall communication cost.
Overlap communication and computation to further
reduce communication cost.
We proposed Two-dimensional Pipelined method with
EARTH multi-threading technique.

7
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
8
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
9
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
10
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
11
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
12
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
13
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn
0 Inter-phase communication cost Ct P (P 1)
N / P N (P 1) Overall communication cost
C Cn Ct N (P 1) NP
14
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
15
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
16
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
17
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
18
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
19
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
20
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
21
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
22
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
23
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
24
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
25
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
26
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
27
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
28
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N /
P) P P NP Inter-phase communication cost
Ct 0 Overall communication cost C Cn
Ct NP
29
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
3
4
30
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
31
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
32
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
33
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
34
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
35
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
36
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
37
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
38
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
39
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
40
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N
sqrt(P) Inter-phase communication cost Ct (N
/ P) sqrt(P) P N sqrt(P) Overall
communication cost C 2 N sqrt(P)
41
Algorithm (continued)
Multi-threading Technique
1
2
1
2
3
4
1
1
1
1
1
1
1
1
3
4

There is no data dependence between the two
halves.
When the first half MVM finishes in the Execution
Unit (EU) and the request to send out the first
half of the result vector is writen to Event
Queue (EQ), the second half MVM can start on EU
immediately.
EARTH system dedicates Synchronization Unit (SU)
handling communication requests across the
network, hence communication and computation can
be overlapped.

42
Scalability Results

Test platform
Chiba City in Argonne National Laboratory (ANL) -
a cluster with 256 dual CPU nodes connected with
fast ethernet
SEMi - a MANNA machine simulator 5
We used the same matrices as NAS parallel CG
benchmark 6

43
Scalability Results (continued)
Threaded_C implementation scalability results on
Chiba City
44
Scalability Results (continued)
NAS CG(MPI) benchmark scalability results on
Chiba City
45
Scalability Results (continued)
Scalability Comparison with NAS Parallel CG on
Chiba City
46
Scalability Results (continued)
Threaded_C implementation scalability results on
SEMi
47
Conclusion

With the two-dimensional pipelined method the
overall communication cost can be reduced to 2 /
sqrt(P) of one-dimensional blocking method
(vertical or horizontal).
The underlying EARTH system, which is a adaptive,
event-driven multi-threaded execution model,
makes it possible to overlap communication and
computation in our implementation.
Notable scalability improvement was achieved by
implementing the two-dimensional pipelined method
on EARTH multi-threaded architecture.

48
Future Work

Port the EARTH runtime system to clusters with
Myrinet connection.
Investigate how to use two-dimensional pipelined
method and EARTH system support to improve the
performance of parallel scientific computing
tools.

49
Reference
1 Jonathan R. Shewchuk, AN INTRODUCTION TO THE
CONJUGATE GRADIENT METHOD WITHOUT THE AGONIZING
PAIN. 2 P. Kloos, P. Blaise, F. Mathey, OPENMP
AND MPI PROGRAMMING WITH A CG ALGORITHM, page 5,
CEA, http//www.epcc.ed.ac.uk/ewomp2000/Presentati
ons/KloosSlides.pdf. 3 Herber H. J. Hum,
Olivier Maquelin, Kevin B. Theobald, Xinmin Tian,
Guang R. Gao, and Laurie J. Hendren, A STUDY OF
THE EARTH-MANNA MULTI-THREADED SYSTEM,
International Journal of Parallel Programming,
24(4)319-347, August 1996. 4 Kevin B.
Theobald. EARTH AN EFFICIENT ARCHITECTURE FOR
RUNNING THREADS, PhD thesis, McGill University,
Montreal, Quebec, May 1999. 5 Kevin B.
Theobald, SEMI A SIMULATOR FOR EARTH, MANNA, AND
I860 (VERSION 0.23), CAPSL Technical Memo 27,
March 1, 1999. In ftp//ftp.capsl.udel.edu/pub/doc
/memos. 6 R. C. Agarwal, B. Alpern, L. Carter,
F. G. Gustavson, D. J. Klepacki, R. Lawrence, M.
Zubair, HIGH-PERFORMANCE PARALLEL IMPLEMENTAIONS
OF THE NAS KERNEL BENCHMARKS ON THE IBM SP22,
IBM SYSTEMS JOURNAL, VOL 34, NO 2, 1995.

Write a Comment

User Comments (0)