From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model - PowerPoint PPT Presentation

About This Presentation

Title:

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

Description:

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model Guang R. Gao Computer Architecture & Parallel Systems Laboratory (CAPSL) University of Delaware – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 38

Provided by: Guang7

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

1
From EARTH to HTMTThe Evolution of a
Multithreaded Architecture Model

Guang R. Gao
Computer Architecture Parallel Systems
Laboratory
(CAPSL)
University of Delaware

2
Outline

Introduction
The EARTH Execution and Architecture Model
The EARTH Programming Model and Threaded-C
Application Studies and Performance Evaluation
Related Work and Conclusions

3
Scalable

Main Challenges
High-Performance
Parallel Systems

for both Class A and Class B Applications
4
Challenges The Killer Latency Problem
P
Latency due to - Communication -
Synchronization - task spawning - load balancing
NI
C
M
Network
P
NI
C
M
SP2 is hard enough, PC clusters is much worse !
5
Meeting High-End Application Challenges

Observation I Many such Applications have Bad
Latencies demanding good support of adaptive
fine-grain parallelism

Petaflop-2 Conference, 99-2
6
Here Comes the Surprise!Theobalds Ph.D.
thesis, May, 1999

Observation II It is not necessarily too hard to
generate and program fine-grain threads!

However, it may be hard to statically group them
into coarse-grain threads!
7
A Base Adaptive Fine-Grain Multithreaded
Execution Model
C1 (Abundance) a very large pool of threads
C2 (ultra-light weight) can be spawned as
easily and as quickly as possible
C3 (Mobility) Adaptively migratable as easily
and as quickly as possible
8
Motivation of The EARTH Project
How to exploit fine-grain multithreadeding on a
parallel system given off-the-shelf
microprocessors
9
Two Types of Fine-Grain Threads

A parallel function invocation
Strand/Fiber - a function body can be divided
into several strands/fibers

10
Threads and Fibers

A fiber becomes enabled if it has received all
input signals
An enabled fiber may be selected for
execution when the required hardware
resource has been allocated

After finished execution, a signal is sent to all
destination fiber to update the corresponding
sync slots

Note The role of strand !
11
The Execution Model of Fibers

Dependence-Driven firing rule for fibers
Fiber is atomic and ultra-light weighted
Relation with dataflow model (Dennis72)

12
The Threaded C Language

Threaded C ANSI C extensions for
multithreading
Extensions include
Threaded functions
Threaded synchronization
Support for global addresses
Data transfer primitives
Threaded C is
The instruction set of the
EARTH processor
A target language for
high-level compilers

C
FORTRAN
High-Level Language Translation
Users
Treaded C
Threaded C Compiler
EARTH Platforms
13
(No Transcript)
14
An Evolutionary Path for EARTH
SU-int
SU-ext
CPU / SU
CPU
SU
MANNA-dual/spn
- Parallel machines - PC-clusters - ...
lt
SEMi Simulation Platform (Theobald99)
15
Platforms for EARTH

MANNA
MANNA is architecture testbed from GMD
benchmarking platform for fine-grain
multithreading
EARTH-SP2
EARTH-Beowulf (Linux based)
EARTH-SUN/SMP/Cluster

16
Unique Advantages of EARTH-MANNA Platform

We can push OS completely out of the way!
We can design the EARTH runtime system from very
low level up
The invaluable experience/lessons learned from
EARTH-MANNA are essential for the successful
migration of the EARTH model to other platforms
(e.g. the IBM SP-2 story, etc.)

17
(No Transcript)
18
(No Transcript)
19
Performance of EARTH-MANNAon N-Queens(12)(1,637,
099 tokens are generated! On the average, 30
tokens are maintained per processors)
Parafin, protein folding, etc.
20
Sumamry of Recent Experimental Results (Kevin99)

Impressive speedup and scalability (scalable even
with high overhead fine-grain parallel programs
e.g. fib)
Enhanced Programmability (N-queen-p example)
Broad applicability

21
Experiements

Example 1 (assorted benchmarks) fib, nqueen,
paraffin, tomcatv, matrix-multiply,etc.
Example 2 Adaptive unstructured grids
Example 3 Wavelet computation

22
(No Transcript)
23
(No Transcript)
24
Performance of N-Queens(12)Theobald99

117.8 fold speedup on a 120 node simulation!
1,637,099 tokens are generated !
average, 30 tokens are maintained per processors
n-QUUEN is a useful HTMT benchmark after all !
(Phil Murkey)

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Coarse-Grain Applications

116 fold speedup on 120-node machine is achieved
for Cannons matrix multiply algorithm!
Deep software systolic-style implementation to
exploit paralelism
Fine-grain mechanisms

29
Example 2 --- Adaptive Unstructured Mesh
Computation

Observation
The critical part of the framework is mesh
adaptation and load balancing
Partitioning problem in better shape, remapping
problem open

30
(No Transcript)
31
Initial Results

About 3000 lines of Threaded-C code
migration gt 70 (good)
Unbiased variance 3 - 5 (very good)
A good speedup on EARTH-MANNA has been observed

32
Performance of 2D Mesh Adaptation
EARTH-MANNA vs SP-2
EARTH-MANNA SP-2
33
Example 3 --- Adaptive Wavelet Transformation

Load evolution pattern is dynamically changing,
but is statically predictable
Need adaptive load redistribution/grouping
Mapping onto EARTH IPPS99

34
HTMT Facility (Perspective)
35
HTMT Architecture
36
Extensions to CurrentEARTH Model

Percolation Model
Memory Model Location Consistency
Load balancing and percolation

37
HTMT Percolation Model
CRYOGENIC AREA
I-Pool
A-Pool
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
T-Pool
D-Pool
Run Time System
DMA
SRAM-PIM
38
The System Software Architecture

Note
The threaded-C compiler has part of its functions
embedded in RTS
The RTS will work with architecture and OS layers
to provide the PXM interface
The performance models Are defined across all
layers

39
Evolution of Multithreaded Architecture Models
CHoPP77
CHoPP87
Non-dataflow based
MASA Halstead 1986
Alwife Agarwal 1989-96
XMT Vishkin
HEP B. Smith 1978
Tera B. Smith 1990-
CDC 6600 1964
J-Machine Dally 1988-93
M-Machine Dally 1994-98
Flynns Processor 1969
Cosmic Cube Seiltz 1985
Others Multiscalar (1994), SMT (1995), etc.
Monsoon Papadopoulos Culler 1988
Dataflow model inspired
T/Start-NG MIT/Motorola 1991-
P-RISC Nikhil Arvind 1989
MIT TTDA Arvind 1980
Cilk Leiserson
TAM Culler 1990
Iannucis 1988-92
Static Dataflow Dennis 1972 MIT
EM-5/4/X RWC-1 1992-97
Manchester Gurd Watson 1982
SIGMA-I Shimada 1988
Arg-Fetching Dataflow DennisGao 1987-88
MTA HumTheobald Gao 94
EARTH PACT95, ISCA96, Theobald99
MDFA Gao 1989-93
40
Acknowledgement(Incomplete List)
NSERC, FCAR, DARPA,NSA,NSF,NASA