Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture

Description:

Find a matrix blocking method which can reduce overall communication cost. ... We proposed Two-dimensional Pipelined method with EARTH multi-threading technique. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 50
Provided by: capsl
Category:

less

Transcript and Presenter's Notes

Title: Implementing Parallel CG Algorithm on the EARTH Multithreaded Architecture


1
Implementing Parallel CG Algorithm on the EARTH
Multi-threaded Architecture
  • Fei Chen Kevin Theobald
  • Ziang Hu Haiping Wu Yan Xie
  • Guang R. Gao
  • CAPSL
  • Electrical and Computer Engineering
  • University of Delaware

MASPLAS '03 Mid-Atlantic Student Workshop
on Programming Languages and Systems Saturday,
April 26th, 2003
2
Outline
  • Introduction
  • Algorithm
  • Scalability Results
  • Conclusion
  • Future Work

3
Introduction
The Conjugate Gradient (CG) is the most popular
iterative method for solving large systems of
linear equations Ax b 1.
4
Introduction (continued)
The matrix A is usually big and sparse, and as
previous studies showed, matrix vector multiply
(MVM) costs 95 CPU time and the other 5 for
vector-vector products (VVP) 2.
Parallel CG Algorithm
Distribute A x among nodes
...
A1x1
A2x2
Anxn
Local MVM
Scale variables reduction broadcast
...
Calculate new local vectors
A1x1
A2x2
Anxn
Redistribute new local vectors
5
Introduction (continued)
EARTH (Efficient Architecture for Running
Threads) architecture 3
  • EARTH supports fibers, which are non-preemptive
    and scheduled in response to dataflow-like
    synchronization operations.
  • Data and control dependences are explicitly
    programed (Threaded-C) with EARTH operations
    among those fibers 4.

6
Algorithm
  • Design objective
  • Find a matrix blocking method which can reduce
    overall communication cost.
  • Overlap communication and computation to further
    reduce communication cost.
  • We proposed Two-dimensional Pipelined method with
    EARTH multi-threading technique.

7
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
8
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
9
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
10
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
11
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
12
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
13
Algorithm (continued)
Horizontal Blocking Method
1
2
3
4
1
2
3
4
1
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn
0 Inter-phase communication cost Ct P (P 1)
N / P N (P 1) Overall communication cost
C Cn Ct N (P 1) NP
14
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
15
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
16
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
17
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
1
1
1
1
18
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
19
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
20
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
2
2
2
2
21
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
22
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
23
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
3
3
3
3
24
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
25
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
26
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
27
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
28
Algorithm (continued)
Vertical Blocking Method (Pipelined)
1
2
3
4
1
2
3
4
4
4
4
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N /
P) P P NP Inter-phase communication cost
Ct 0 Overall communication cost C Cn
Ct NP
29
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
3
4
30
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
31
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
32
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
1
1
1
1
3
4
33
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
34
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
35
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
36
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
37
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
38
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
39
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N sqrt(P)
40
Algorithm (continued)
Two-dimensional Pipelined Method
1
2
1
2
3
4
2
2
2
2
2
2
2
2
3
4
P is the nodes number, and N is the size of the
matrix. Inner-phase communication cost Cn (N
/ P) sqrt(P) P N
sqrt(P) Inter-phase communication cost Ct (N
/ P) sqrt(P) P N sqrt(P) Overall
communication cost C 2 N sqrt(P)
41
Algorithm (continued)
Multi-threading Technique
1
2
1
2
3
4
1
1
1
1
1
1
1
1
3
4
  • There is no data dependence between the two
    halves.
  • When the first half MVM finishes in the Execution
    Unit (EU) and the request to send out the first
    half of the result vector is writen to Event
    Queue (EQ), the second half MVM can start on EU
    immediately.
  • EARTH system dedicates Synchronization Unit (SU)
    handling communication requests across the
    network, hence communication and computation can
    be overlapped.

42
Scalability Results
  • Test platform
  • Chiba City in Argonne National Laboratory (ANL) -
    a cluster with 256 dual CPU nodes connected with
    fast ethernet
  • SEMi - a MANNA machine simulator 5
  • We used the same matrices as NAS parallel CG
    benchmark 6

43
Scalability Results (continued)
Threaded_C implementation scalability results on
Chiba City
44
Scalability Results (continued)
NAS CG(MPI) benchmark scalability results on
Chiba City
45
Scalability Results (continued)
Scalability Comparison with NAS Parallel CG on
Chiba City
46
Scalability Results (continued)
Threaded_C implementation scalability results on
SEMi
47
Conclusion
  • With the two-dimensional pipelined method the
    overall communication cost can be reduced to 2 /
    sqrt(P) of one-dimensional blocking method
    (vertical or horizontal).
  • The underlying EARTH system, which is a adaptive,
    event-driven multi-threaded execution model,
    makes it possible to overlap communication and
    computation in our implementation.
  • Notable scalability improvement was achieved by
    implementing the two-dimensional pipelined method
    on EARTH multi-threaded architecture.

48
Future Work
  • Port the EARTH runtime system to clusters with
    Myrinet connection.
  • Investigate how to use two-dimensional pipelined
    method and EARTH system support to improve the
    performance of parallel scientific computing
    tools.

49
Reference
1 Jonathan R. Shewchuk, AN INTRODUCTION TO THE
CONJUGATE GRADIENT METHOD WITHOUT THE AGONIZING
PAIN. 2 P. Kloos, P. Blaise, F. Mathey, OPENMP
AND MPI PROGRAMMING WITH A CG ALGORITHM, page 5,
CEA, http//www.epcc.ed.ac.uk/ewomp2000/Presentati
ons/KloosSlides.pdf. 3 Herber H. J. Hum,
Olivier Maquelin, Kevin B. Theobald, Xinmin Tian,
Guang R. Gao, and Laurie J. Hendren, A STUDY OF
THE EARTH-MANNA MULTI-THREADED SYSTEM,
International Journal of Parallel Programming,
24(4)319-347, August 1996. 4 Kevin B.
Theobald. EARTH AN EFFICIENT ARCHITECTURE FOR
RUNNING THREADS, PhD thesis, McGill University,
Montreal, Quebec, May 1999. 5 Kevin B.
Theobald, SEMI A SIMULATOR FOR EARTH, MANNA, AND
I860 (VERSION 0.23), CAPSL Technical Memo 27,
March 1, 1999. In ftp//ftp.capsl.udel.edu/pub/doc
/memos. 6 R. C. Agarwal, B. Alpern, L. Carter,
F. G. Gustavson, D. J. Klepacki, R. Lawrence, M.
Zubair, HIGH-PERFORMANCE PARALLEL IMPLEMENTAIONS
OF THE NAS KERNEL BENCHMARKS ON THE IBM SP22,
IBM SYSTEMS JOURNAL, VOL 34, NO 2, 1995.
Write a Comment
User Comments (0)
About PowerShow.com