Title: From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model
1From EARTH to HTMTThe Evolution of a
Multithreaded Architecture Model
- Guang R. Gao
- Computer Architecture Parallel Systems
Laboratory - (CAPSL)
- University of Delaware
2Outline
- Introduction
- The EARTH Execution and Architecture Model
- The EARTH Programming Model and Threaded-C
- Application Studies and Performance Evaluation
- Related Work and Conclusions
3 Scalable
- Main Challenges
-
- High-Performance
- Parallel Systems
for both Class A and Class B Applications
4Challenges The Killer Latency Problem
P
Latency due to - Communication -
Synchronization - task spawning - load balancing
NI
C
M
Network
P
NI
C
M
SP2 is hard enough, PC clusters is much worse !
5Meeting High-End Application Challenges
- Observation I Many such Applications have Bad
Latencies demanding good support of adaptive
fine-grain parallelism
Petaflop-2 Conference, 99-2
6Here Comes the Surprise!Theobalds Ph.D.
thesis, May, 1999
- Observation II It is not necessarily too hard to
generate and program fine-grain threads!
However, it may be hard to statically group them
into coarse-grain threads!
7A Base Adaptive Fine-Grain Multithreaded
Execution Model
C1 (Abundance) a very large pool of threads
C2 (ultra-light weight) can be spawned as
easily and as quickly as possible
C3 (Mobility) Adaptively migratable as easily
and as quickly as possible
8Motivation of The EARTH Project
How to exploit fine-grain multithreadeding on a
parallel system given off-the-shelf
microprocessors
9Two Types of Fine-Grain Threads
- A parallel function invocation
- Strand/Fiber - a function body can be divided
into several strands/fibers
10Threads and Fibers
- A fiber becomes enabled if it has received all
input signals - An enabled fiber may be selected for
- execution when the required hardware
- resource has been allocated
- After finished execution, a signal is sent to all
destination fiber to update the corresponding
sync slots
Note The role of strand !
11The Execution Model of Fibers
- Dependence-Driven firing rule for fibers
- Fiber is atomic and ultra-light weighted
- Relation with dataflow model (Dennis72)
12The Threaded C Language
- Threaded C ANSI C extensions for
multithreading - Extensions include
- Threaded functions
- Threaded synchronization
- Support for global addresses
- Data transfer primitives
- Threaded C is
- The instruction set of the
- EARTH processor
- A target language for
- high-level compilers
C
FORTRAN
High-Level Language Translation
Users
Treaded C
Threaded C Compiler
EARTH Platforms
13(No Transcript)
14An Evolutionary Path for EARTH
SU-int
SU-ext
CPU / SU
CPU
SU
MANNA-dual/spn
- Parallel machines - PC-clusters - ...
lt
SEMi Simulation Platform (Theobald99)
15Platforms for EARTH
- MANNA
- MANNA is architecture testbed from GMD
- benchmarking platform for fine-grain
multithreading - EARTH-SP2
- EARTH-Beowulf (Linux based)
- EARTH-SUN/SMP/Cluster
16Unique Advantages of EARTH-MANNA Platform
- We can push OS completely out of the way!
- We can design the EARTH runtime system from very
low level up - The invaluable experience/lessons learned from
EARTH-MANNA are essential for the successful
migration of the EARTH model to other platforms
(e.g. the IBM SP-2 story, etc.)
17(No Transcript)
18(No Transcript)
19Performance of EARTH-MANNAon N-Queens(12)(1,637,
099 tokens are generated! On the average, 30
tokens are maintained per processors)
Parafin, protein folding, etc.
20Sumamry of Recent Experimental Results (Kevin99)
- Impressive speedup and scalability (scalable even
with high overhead fine-grain parallel programs
e.g. fib) - Enhanced Programmability (N-queen-p example)
- Broad applicability
21Experiements
- Example 1 (assorted benchmarks) fib, nqueen,
paraffin, tomcatv, matrix-multiply,etc. - Example 2 Adaptive unstructured grids
- Example 3 Wavelet computation
22(No Transcript)
23(No Transcript)
24Performance of N-Queens(12)Theobald99
- 117.8 fold speedup on a 120 node simulation!
- 1,637,099 tokens are generated !
- average, 30 tokens are maintained per processors
- n-QUUEN is a useful HTMT benchmark after all !
(Phil Murkey)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28Coarse-Grain Applications
- 116 fold speedup on 120-node machine is achieved
for Cannons matrix multiply algorithm! - Deep software systolic-style implementation to
exploit paralelism - Fine-grain mechanisms
29Example 2 --- Adaptive Unstructured Mesh
Computation
- Observation
- The critical part of the framework is mesh
adaptation and load balancing - Partitioning problem in better shape, remapping
problem open
30(No Transcript)
31Initial Results
- About 3000 lines of Threaded-C code
- migration gt 70 (good)
- Unbiased variance 3 - 5 (very good)
- A good speedup on EARTH-MANNA has been observed
32Performance of 2D Mesh Adaptation
EARTH-MANNA vs SP-2
EARTH-MANNA SP-2
33Example 3 --- Adaptive Wavelet Transformation
- Load evolution pattern is dynamically changing,
but is statically predictable - Need adaptive load redistribution/grouping
- Mapping onto EARTH IPPS99
34HTMT Facility (Perspective)
35HTMT Architecture
36Extensions to CurrentEARTH Model
- Percolation Model
- Memory Model Location Consistency
- Load balancing and percolation
37HTMT Percolation Model
CRYOGENIC AREA
I-Pool
A-Pool
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
T-Pool
D-Pool
Run Time System
DMA
SRAM-PIM
38The System Software Architecture
- Note
- The threaded-C compiler has part of its functions
embedded in RTS - The RTS will work with architecture and OS layers
to provide the PXM interface - The performance models Are defined across all
layers
39Evolution of Multithreaded Architecture Models
CHoPP77
CHoPP87
Non-dataflow based
MASA Halstead 1986
Alwife Agarwal 1989-96
XMT Vishkin
HEP B. Smith 1978
Tera B. Smith 1990-
CDC 6600 1964
J-Machine Dally 1988-93
M-Machine Dally 1994-98
Flynns Processor 1969
Cosmic Cube Seiltz 1985
Others Multiscalar (1994), SMT (1995), etc.
Monsoon Papadopoulos Culler 1988
Dataflow model inspired
T/Start-NG MIT/Motorola 1991-
P-RISC Nikhil Arvind 1989
MIT TTDA Arvind 1980
Cilk Leiserson
TAM Culler 1990
Iannucis 1988-92
Static Dataflow Dennis 1972 MIT
EM-5/4/X RWC-1 1992-97
Manchester Gurd Watson 1982
SIGMA-I Shimada 1988
Arg-Fetching Dataflow DennisGao 1987-88
MTA HumTheobald Gao 94
EARTH PACT95, ISCA96, Theobald99
MDFA Gao 1989-93
40Acknowledgement(Incomplete List)
NSERC, FCAR, DARPA,NSA,NSF,NASA
- Erik Altman
- Haiying Cai
- Nasser Elmasri
- Gerd Heber
- Laurie J. Hendren
- Herbert Hum
- Alberto Jimenez
- Prasad Kakulavarapu
- Cheng Li
- Olivier Maquelin
- Andres Marquez
- Shashank Nemawarkar
- Zach Ruiz
- Sean Ryan
- V.C. Sreedhar
- Xinan Tang
- Kevin Theobald
- Ruppa Thulasiram
- Parimala Thulasiraman
- Xinmin Tian
- Yingchun Zhu
- J. Nelson Amaral