From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model - PowerPoint PPT Presentation

About This Presentation
Title:

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

Description:

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model Guang R. Gao Computer Architecture & Parallel Systems Laboratory (CAPSL) University of Delaware – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 38
Provided by: Guang7
Category:

less

Transcript and Presenter's Notes

Title: From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model


1
From EARTH to HTMTThe Evolution of a
Multithreaded Architecture Model
  • Guang R. Gao
  • Computer Architecture Parallel Systems
    Laboratory
  • (CAPSL)
  • University of Delaware

2
Outline
  • Introduction
  • The EARTH Execution and Architecture Model
  • The EARTH Programming Model and Threaded-C
  • Application Studies and Performance Evaluation
  • Related Work and Conclusions

3
Scalable
  • Main Challenges
  • High-Performance
  • Parallel Systems

for both Class A and Class B Applications
4
Challenges The Killer Latency Problem
P
Latency due to - Communication -
Synchronization - task spawning - load balancing
NI
C
M
Network
P
NI
C
M
SP2 is hard enough, PC clusters is much worse !
5
Meeting High-End Application Challenges
  • Observation I Many such Applications have Bad
    Latencies demanding good support of adaptive
    fine-grain parallelism

Petaflop-2 Conference, 99-2
6
Here Comes the Surprise!Theobalds Ph.D.
thesis, May, 1999
  • Observation II It is not necessarily too hard to
    generate and program fine-grain threads!

However, it may be hard to statically group them
into coarse-grain threads!
7
A Base Adaptive Fine-Grain Multithreaded
Execution Model
C1 (Abundance) a very large pool of threads
C2 (ultra-light weight) can be spawned as
easily and as quickly as possible
C3 (Mobility) Adaptively migratable as easily
and as quickly as possible
8
Motivation of The EARTH Project
How to exploit fine-grain multithreadeding on a
parallel system given off-the-shelf
microprocessors
9
Two Types of Fine-Grain Threads
  • A parallel function invocation
  • Strand/Fiber - a function body can be divided
    into several strands/fibers

10
Threads and Fibers
  • A fiber becomes enabled if it has received all
    input signals
  • An enabled fiber may be selected for
  • execution when the required hardware
  • resource has been allocated
  • After finished execution, a signal is sent to all
    destination fiber to update the corresponding
    sync slots

Note The role of strand !
11
The Execution Model of Fibers
  • Dependence-Driven firing rule for fibers
  • Fiber is atomic and ultra-light weighted
  • Relation with dataflow model (Dennis72)

12
The Threaded C Language
  • Threaded C ANSI C extensions for
    multithreading
  • Extensions include
  • Threaded functions
  • Threaded synchronization
  • Support for global addresses
  • Data transfer primitives
  • Threaded C is
  • The instruction set of the
  • EARTH processor
  • A target language for
  • high-level compilers

C
FORTRAN
High-Level Language Translation
Users
Treaded C
Threaded C Compiler
EARTH Platforms
13
(No Transcript)
14
An Evolutionary Path for EARTH
SU-int
SU-ext
CPU / SU
CPU
SU
MANNA-dual/spn
- Parallel machines - PC-clusters - ...
lt
SEMi Simulation Platform (Theobald99)
15
Platforms for EARTH
  • MANNA
  • MANNA is architecture testbed from GMD
  • benchmarking platform for fine-grain
    multithreading
  • EARTH-SP2
  • EARTH-Beowulf (Linux based)
  • EARTH-SUN/SMP/Cluster

16
Unique Advantages of EARTH-MANNA Platform
  • We can push OS completely out of the way!
  • We can design the EARTH runtime system from very
    low level up
  • The invaluable experience/lessons learned from
    EARTH-MANNA are essential for the successful
    migration of the EARTH model to other platforms
    (e.g. the IBM SP-2 story, etc.)

17
(No Transcript)
18
(No Transcript)
19
Performance of EARTH-MANNAon N-Queens(12)(1,637,
099 tokens are generated! On the average, 30
tokens are maintained per processors)
Parafin, protein folding, etc.
20
Sumamry of Recent Experimental Results (Kevin99)
  • Impressive speedup and scalability (scalable even
    with high overhead fine-grain parallel programs
    e.g. fib)
  • Enhanced Programmability (N-queen-p example)
  • Broad applicability

21
Experiements
  • Example 1 (assorted benchmarks) fib, nqueen,
    paraffin, tomcatv, matrix-multiply,etc.
  • Example 2 Adaptive unstructured grids
  • Example 3 Wavelet computation

22
(No Transcript)
23
(No Transcript)
24
Performance of N-Queens(12)Theobald99
  • 117.8 fold speedup on a 120 node simulation!
  • 1,637,099 tokens are generated !
  • average, 30 tokens are maintained per processors
  • n-QUUEN is a useful HTMT benchmark after all !
    (Phil Murkey)

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Coarse-Grain Applications
  • 116 fold speedup on 120-node machine is achieved
    for Cannons matrix multiply algorithm!
  • Deep software systolic-style implementation to
    exploit paralelism
  • Fine-grain mechanisms

29
Example 2 --- Adaptive Unstructured Mesh
Computation
  • Observation
  • The critical part of the framework is mesh
    adaptation and load balancing
  • Partitioning problem in better shape, remapping
    problem open

30
(No Transcript)
31
Initial Results
  • About 3000 lines of Threaded-C code
  • migration gt 70 (good)
  • Unbiased variance 3 - 5 (very good)
  • A good speedup on EARTH-MANNA has been observed

32
Performance of 2D Mesh Adaptation
EARTH-MANNA vs SP-2
EARTH-MANNA SP-2
33
Example 3 --- Adaptive Wavelet Transformation
  • Load evolution pattern is dynamically changing,
    but is statically predictable
  • Need adaptive load redistribution/grouping
  • Mapping onto EARTH IPPS99

34
HTMT Facility (Perspective)
35
HTMT Architecture
36
Extensions to CurrentEARTH Model
  • Percolation Model
  • Memory Model Location Consistency
  • Load balancing and percolation

37
HTMT Percolation Model
CRYOGENIC AREA
I-Pool
A-Pool
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
T-Pool
D-Pool
Run Time System
DMA
SRAM-PIM
38
The System Software Architecture
  • Note
  • The threaded-C compiler has part of its functions
    embedded in RTS
  • The RTS will work with architecture and OS layers
    to provide the PXM interface
  • The performance models Are defined across all
    layers

39
Evolution of Multithreaded Architecture Models
CHoPP77
CHoPP87
Non-dataflow based
MASA Halstead 1986
Alwife Agarwal 1989-96
XMT Vishkin
HEP B. Smith 1978
Tera B. Smith 1990-
CDC 6600 1964
J-Machine Dally 1988-93
M-Machine Dally 1994-98
Flynns Processor 1969
Cosmic Cube Seiltz 1985
Others Multiscalar (1994), SMT (1995), etc.
Monsoon Papadopoulos Culler 1988
Dataflow model inspired
T/Start-NG MIT/Motorola 1991-
P-RISC Nikhil Arvind 1989
MIT TTDA Arvind 1980
Cilk Leiserson
TAM Culler 1990
Iannucis 1988-92
Static Dataflow Dennis 1972 MIT
EM-5/4/X RWC-1 1992-97
Manchester Gurd Watson 1982
SIGMA-I Shimada 1988
Arg-Fetching Dataflow DennisGao 1987-88
MTA HumTheobald Gao 94
EARTH PACT95, ISCA96, Theobald99
MDFA Gao 1989-93
40
Acknowledgement(Incomplete List)
NSERC, FCAR, DARPA,NSA,NSF,NASA
  • Erik Altman
  • Haiying Cai
  • Nasser Elmasri
  • Gerd Heber
  • Laurie J. Hendren
  • Herbert Hum
  • Alberto Jimenez
  • Prasad Kakulavarapu
  • Cheng Li
  • Olivier Maquelin
  • Andres Marquez
  • Shashank Nemawarkar
  • Zach Ruiz
  • Sean Ryan
  • V.C. Sreedhar
  • Xinan Tang
  • Kevin Theobald
  • Ruppa Thulasiram
  • Parimala Thulasiraman
  • Xinmin Tian
  • Yingchun Zhu
  • J. Nelson Amaral
Write a Comment
User Comments (0)
About PowerShow.com