On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre

Description:

... parallelism (ILP ... grained (instruction-level) parallelism is no longer ... Exploit primarily loop-level parallelism. Very good parallelizing compiler ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 42

Provided by: penchu

Category:

more less

Transcript and Presenter's Notes

Title: On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre

1
On Recent Development in High-Performance
Computer Architectures and Compilation
Techniques(Multithreading)

Prof. Pen-Chung Yew
Dept. of Computer Science and Eng
University of Minnesota
http//www.cs.umn.edu/Research/Agassiz

2
Outline

Superscalar/superpipeline architectures
Ways to further improve performance
Multithreaded Processor Architectures
Performance enhancement through control
speculation
Performance enhancement through value prediction
Performance enhancement through dynamic
recompilation (not included)

3
Current Directions in Computer Design

Continue to stride for higher performance
Low power design to improve performance/power
ratio
Improve dependability small feature size
increases error rates

4
Trends in Hardware Technology

Available transistors on a single chip continue
to grow
More functionalities on a single chip, but need
to consider performance/power tradeoffs (Fig.
1-3)
Exploring more parallelism (ILP/TLP)
Clock rate increase limited more by wire delay
than by gate delay
Current single, large-window, wide-issue
architectures not scalable
Use deeper pipeline, multiple simple processing
units on a chip, or multithreaded architectures.
New high-performance special purpose (embedded)
architectures

5
Software Technology Trends

Rely more on compiler technology to simplify
hardware complexity and to improve efficiency
Profile-based techniques and dynamic compilation
are gaining significance
Fine-grained (instruction-level) parallelism is
no longer sufficient for large issue rate - Need
to exploit medium-grained (loop-level or
procedure-level ) parallelism

6
Strength Limitation of Existing Approaches -
Superscalar

Strength
Multiple-issue per clock cycle
Accurate and effective runtime dependence check
Out-of-order execution for high ILP
HW supported instruction-level speculation on
values, dependences, and branch outcomes

7
Strength Limitation of Existing Approaches -
Superscalar

Limitations
A single instruction window is not easy to scale
(limited by wire delay)
Limited to only instruction-level parallelism
(not enough parallelism)
Compilers exploit mostly innermost loops

8
Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)

Strength
More scalable (allow shorter wires)
More flexible (various interconnect schemes)
Exploit primarily loop-level parallelism
Very good parallelizing compiler technology
available

9
Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)

Limitations
Communication and sync through memory
Higher overhead
Need larger granularity
Exploit parallel loops (Doall) only
May need a coherence cache
Primarily for scientific, not general-purpose
applications

10
Strength and Limitation of Existing
Approaches-VLIW

Strength
Relatively simple HW support
Using compiler to resolve dependences at compile
time and schedule instructions
Trace scheduling
Software pipelining
Extensive use of profiling information in compiler

11
Strength and Limitation of Existing
Approaches-VLIW

Limitation
Code not portable, may expand to a large size
Independent instructions are executed in locked
steps
Limited to ILP only
Compiler exploits only innermost loop
Figures

12
Challenges in General-Purpose Applications

Mostly Do-while loops
Need thread-level speculation
not available in current multiprocessors
Parallelism exists in outer loops
Need thread-level support
N/A in superscalar
Good in multiprocessors
Pointers/aliases complicate dependence analysis
Need runtime dependence check both within and
between threads, or use data speculation
N/A in multiprocessor and superscalar

13
Challenges in General-Purpose Applications (Cont.)

Small basic blocks in straight-line code
Need instruction-level speculative/predicated
execution, branch prediction, profiling-based
compilation, out-of-order execution
very well developed in superscalars

14
Challenges in General-Purpose Applications (Cont.)

Many small loops, Doacross loops and parallel
sections
Need fast, low overhead communication/synchronizat
ion support
Communication directly between register files, or
special memory buffers without going through
memory
not available in current MOAC

15
Multithreaded Architectures (To Improve
Single-Program Speedup)

A hybrid of superscalar and MOAC
More scalable than superscalar
More general-purpose than MOAC
Supporting multiple threads (multiple program
counters)
Handle loop-level parallelism (or parallel
sections) more easily
Not limited to innermost loops

16
Multithreaded Architectures(To Improve Single
Program Speedup)

Support thread-level speculation (on data and/or
control dependences)
Handle Do-while loops and Do-across loops
Support runtime dependence checking both within
and between threads or support data speculation
Handle pointer/alias problems
Allow fast communication/synchronization

17
Multithreaded Architectures (To Improve Single
Program Speedup)

Multiscalar
Superthreaded Prorcessors
NEC multithreaded processor
Multiprocessor on a chip (MOAC)
Trace processor
Single-Program Speculative Multithreading (SPSM)

18
Other Multithreaded Architectures

Simultaneous multithreading (SMT)
Use multiprogramming to improve throughput
Does not consider scalability
Use multithreading to hide memory latency (single
thread active at a time Tera Computer)
MaxPar tool sets to study thread-level
parallelism

19
Multiscalars

Exploit thread-level parallelism through
control speculation (not stopped by one branch
instructions) - w/ hardware
Data speculation (not stopped by data dependences
between threads) - w/ hardware
Rely on as much hardware support as possible
References
Multiscalar Processors, by Sohi, Breach,
Vijaykumar, Intl Symp on Computer Architecture
(ISCA-22), 1995
Speculative Versioning Cache, by Gopal,
Vijaykumar, Smith, Sohi, Intl Symp. On
High-performance Computer Arch (HPCA-4), 1998

20
Speculative Versioning Cache

Speculative Versioning Cache, by Gopal,
Vijaykumar, Smith, Sohi, Intl Symp. On
High-performance Computer Arch (HPCA-4), 1998
A design to support data dependence speculation
Data Dependences
Memory dependences
Register dependence compiler handles it
Need to computer addresses before determining
dependences exist or not
Need to distinguish execution order vs. program
order

21
Handling DD in Single-Threaded Processor

Load
Need to know all previous stores before issuing a
load (avoid violating RAW dependences)
Store
Need to know (1) its address and (2) all previous
instruction without exceptions (avoid changing
machine states)
Only single version is maintained in the system
Determining addresses before load/store forces
strict execution order degrades ILP performance

22
To Improve ILP Performance

Allow loads to bypass stores if they are to
different memory locations
Still need to know all their addresses before
issuing loads/stores degrades ILP performance
Issue a load as soon as its address is known even
before all previous stores are completed
Data dependence speculation assume all loads
are to addresses different from stores
Allow loads and stores to be executed out of
order
Maintaining multiple versions of data in the same
location

23
Architecture Support Needed for Such Improvement

A re-ordering buffer (queue) and address
comparison mechanism to maintain program order of
all loads and stores
This is one simple form of speculative versioning
in single-threaded execution model
Multi-threaded execution model
A hierarchical execution model
Multiple streams of executions
Intra-stream (same as in single stream)
Inter-stream (Similar to ARB, each entry maintain
all versions at the same memory location)

24
Speculative Versioning Cache (SVC)Introduction

Limitation of ARB
A single shared queue for all threads (needs very
large bandwidth for the queue)
When a task completes, writeback is needed
Create bursty traffic
Delay new task initiation
SVC
Use private cache coherence protocol
Cache hits no bus traffic
Task commits no need for writeback in bursty
mode

25
Speculative Versioning Cache (SVC)Introduction

SVC tracks loads/stores from all threads
A load must eventually read the data from the
most recent store to the same location
A load must be squashed and re-executed if it
violates such an order
All stores to the same location that follow the
load must be buffered until the load is done (to
guard against possible violation)
Speculative versions of a memory location must
eventually be committed in its original program
order

26
SVC Load Operation

A load is executed as soon as its address becomes
available
Speculate that all stores are to different memory
location
The closest version of the memory location from
the same/different tasks is supplied to the load
The load is recorded to check for potential
violation from an earlier store
All stores after the load must be kept to prepare
for the possible violation of the load (squashing)

27
SVC Store and Commit

When a store is executed, it broadcast the data
to all later active tasks
Any later task detecting the violation of any
load (to the same memory location) must be
squashed
All tasks after the violating task are also
squashed (simple squashing model)
Oldest task (non-speculative) commits its memory
state to architecture storage
Commits involve logically copying from SVC to
architecture storage
Example Figure 2
Solid arrowhead (program order) hollow arrowhead
(execution order)
1st snap shot before task 1 executes a store and
detect violation

28
SVC Design

Table 1 memory operations/commit/squash
Multiprocessor caches maintain only one version
of each data in all caches SVC allows multiple
versions of the same data in all caches
Use version control logic (VOL) to maintain
program order among multiple versions of the same
data
Use pointers similar to SCI (scalable coherence
interface) to maintain a linked list to implement
VOL

29
SVC Design

Figure 3 A SMP coherent cache system
A cache line has three states Invalid, Clean,
Dirty (Store)
Hit
Load - to a clean/dirty cache line
Store to a dirty line
Miss
Otherwise including store hit to a clean copy
A busRead (BusWrite) request for (load/store) is
issued to the bus
Replacement
BusWBack to cast out dirty cache lines

30
SVC Design

Example Figure 4
Empty boxes no data
First snapshot before z issue a load
2nd snapshot after load is completed
3rd snapshot Y issues a store
4th snapshot Y replaces a cache line

31
Base SVC Design

Figure 5
Assumption one word per cache line
A load bit is added (Figure 6)
In initial invalid state, the load bit is set
Used to detect load-before-store violation
A pointer is added to point to the next
cache/processor that has the next copy/version of
the same data as described in VOL, i.e. VOL is
implemented as a linked-list
Pointers are pointing to caches/processors not to
tasks

32
Base SVC Design

Add version control logic (VCL) to support VOL
Hit no need to consult VCL
Miss consult VCL (Figure 5)
Behave like ARB
Load misses
VCL locates the closet previous version by
searching VOL in reverse order from the requestor
Task assignment is consulted to determine the
pointer of requestor to VOL
Example Figure 7
Task 2 issues a load causing a miss
Y (not W) provides the data

33
Base SVC Design

Task commits
All dirty lines are immediately written back
Need to maintain a list of stores executed by the
task
All other lines are invalidated
Task squashes
All cache lines are squashed
Drawbacks of such a scheme
Bursty writeback traffic still not taken care of
(Need to distribute traffic over time)
Clean cache lines are invalidated after commits
resulting in cold cache for next new task (retain
read-only data)

34
Supedrthreaded Processors

Exploit thread-level parallelism
control speculation (not stopped by one branch
instructions) - w/ hardware
Data synchronization (not stopped by data
dependences between threads) - w/ compiler
Rely on as much compiler/software support as
possible
Reference
The Superthreaded Processor Architecture, Tsai,
et al IEEE Trans. On Computers, Sept 1999

35
A Possible Hardware Implementation of Multithread
Processors

A 1GIP 1W Single-Chip Tightly-Coupled Four-Way
Multiprocessors with Architecture Support for
Multiple Control Flow Execution, by Toii, et al
2000 IEEE Intl Solid-State Circuits Conf.
One possible implementation of a hybrid of
Multiscalar and Superthreaded

36
Single-Program Speculative Multithreading (SPSM)

Single-Program Speculative Multithreading (SPSM)
Architecture Compiler-Assisted Fine-Grained
Multithreading, by P.D. Dubey, et al, PACT 1995
Advantages
Disadvantages

37
Multiprocessor on a Chip(MOAC)

Speculative Multithreaded Processors, by
Macuello, Gonzalez, Tubella, ICS 98
A Scalable Approach to Thread-Level Speculation,
Steffan, Colohan, Zhai, Mowry, ISCA 2000
Hardware and Software Support for Speculative
Execution of Sequential Binaries on a
Chip-Multiprocessors, Krishnan and Torrellas,
ICS98

38
Simultaneous Multithreading (SMT) Architectures

Simultaneous Multithreading Maximizing On-Chip
Parallelism, by Tullsen, Eggers, Levy, ISCA-22,
1995
Converting Thread-Level Parallelism to
Instruction-Level Parallelism via Simultaneous
Multithreading, by Eggers et al, ACM TOCS, Aug
1997
Tuning Compiler Optimization for Simultaneous
Multithreading, Lo, et al, MICRO-30, 1997

39
Latency Hiding Using MultithreadingTera Computer

Trading parallelism for memory latency
Need very large memory bandwidth to support this
scheme
Trading memory latency with memory bandwidth
Tera is too ambitious
New architecture, new device technology and new
compiler technology

40
How Much Potential Thread-Level Parallelism in
SPEC Benchmarks?

MaxPar simulation tools
How does it work?
Different machine models suitable to be simulated
by such an approach
Some simulation results
Useful tools for
Identify thread-level parallelism
Study various thread execution model using real
benchmarks

41
Other Issues Related to Multithreaded
Architectures

Value Prediction for Speculative Multithreaded
Architectures, Marcuello, Tubella, Gonzalez,
MICRO,1999
Correctly Implementing Value Prediction in
Microprocessors that Support Multithreading or
Multiprocessing, Martin, et al MICRO 2001
The Need for Fast Communication in Hardware-Based
Speculative Chip Multiprocessors, Krishnan and
Torrellas, PACT 1999

Write a Comment

User Comments (0)