Title: On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre
1On Recent Development in High-Performance
Computer Architectures and Compilation
Techniques(Multithreading)
- Prof. Pen-Chung Yew
- Dept. of Computer Science and Eng
- University of Minnesota
- http//www.cs.umn.edu/Research/Agassiz
2Outline
- Superscalar/superpipeline architectures
- Ways to further improve performance
- Multithreaded Processor Architectures
- Performance enhancement through control
speculation - Performance enhancement through value prediction
- Performance enhancement through dynamic
recompilation (not included)
3Current Directions in Computer Design
- Continue to stride for higher performance
- Low power design to improve performance/power
ratio - Improve dependability small feature size
increases error rates
4Trends in Hardware Technology
- Available transistors on a single chip continue
to grow - More functionalities on a single chip, but need
to consider performance/power tradeoffs (Fig.
1-3) - Exploring more parallelism (ILP/TLP)
- Clock rate increase limited more by wire delay
than by gate delay - Current single, large-window, wide-issue
architectures not scalable - Use deeper pipeline, multiple simple processing
units on a chip, or multithreaded architectures. - New high-performance special purpose (embedded)
architectures
5Software Technology Trends
- Rely more on compiler technology to simplify
hardware complexity and to improve efficiency - Profile-based techniques and dynamic compilation
are gaining significance - Fine-grained (instruction-level) parallelism is
no longer sufficient for large issue rate - Need
to exploit medium-grained (loop-level or
procedure-level ) parallelism
6Strength Limitation of Existing Approaches -
Superscalar
- Strength
- Multiple-issue per clock cycle
- Accurate and effective runtime dependence check
- Out-of-order execution for high ILP
- HW supported instruction-level speculation on
values, dependences, and branch outcomes
7Strength Limitation of Existing Approaches -
Superscalar
- Limitations
- A single instruction window is not easy to scale
(limited by wire delay) - Limited to only instruction-level parallelism
(not enough parallelism) - Compilers exploit mostly innermost loops
8Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)
- Strength
- More scalable (allow shorter wires)
- More flexible (various interconnect schemes)
- Exploit primarily loop-level parallelism
- Very good parallelizing compiler technology
available
9Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)
- Limitations
- Communication and sync through memory
- Higher overhead
- Need larger granularity
- Exploit parallel loops (Doall) only
- May need a coherence cache
- Primarily for scientific, not general-purpose
applications
10Strength and Limitation of Existing
Approaches-VLIW
- Strength
- Relatively simple HW support
- Using compiler to resolve dependences at compile
time and schedule instructions - Trace scheduling
- Software pipelining
- Extensive use of profiling information in compiler
11Strength and Limitation of Existing
Approaches-VLIW
- Limitation
- Code not portable, may expand to a large size
- Independent instructions are executed in locked
steps - Limited to ILP only
- Compiler exploits only innermost loop
- Figures
12Challenges in General-Purpose Applications
- Mostly Do-while loops
- Need thread-level speculation
- not available in current multiprocessors
- Parallelism exists in outer loops
- Need thread-level support
- N/A in superscalar
- Good in multiprocessors
- Pointers/aliases complicate dependence analysis
- Need runtime dependence check both within and
between threads, or use data speculation
- N/A in multiprocessor and superscalar
13Challenges in General-Purpose Applications (Cont.)
- Small basic blocks in straight-line code
- Need instruction-level speculative/predicated
execution, branch prediction, profiling-based
compilation, out-of-order execution - very well developed in superscalars
14Challenges in General-Purpose Applications (Cont.)
- Many small loops, Doacross loops and parallel
sections - Need fast, low overhead communication/synchronizat
ion support - Communication directly between register files, or
special memory buffers without going through
memory - not available in current MOAC
15Multithreaded Architectures (To Improve
Single-Program Speedup)
- A hybrid of superscalar and MOAC
- More scalable than superscalar
- More general-purpose than MOAC
- Supporting multiple threads (multiple program
counters) - Handle loop-level parallelism (or parallel
sections) more easily - Not limited to innermost loops
16Multithreaded Architectures(To Improve Single
Program Speedup)
- Support thread-level speculation (on data and/or
control dependences) - Handle Do-while loops and Do-across loops
- Support runtime dependence checking both within
and between threads or support data speculation - Handle pointer/alias problems
- Allow fast communication/synchronization
17Multithreaded Architectures (To Improve Single
Program Speedup)
- Multiscalar
- Superthreaded Prorcessors
- NEC multithreaded processor
- Multiprocessor on a chip (MOAC)
- Trace processor
- Single-Program Speculative Multithreading (SPSM)
18Other Multithreaded Architectures
- Simultaneous multithreading (SMT)
- Use multiprogramming to improve throughput
- Does not consider scalability
- Use multithreading to hide memory latency (single
thread active at a time Tera Computer) - MaxPar tool sets to study thread-level
parallelism
19Multiscalars
- Exploit thread-level parallelism through
- control speculation (not stopped by one branch
instructions) - w/ hardware - Data speculation (not stopped by data dependences
between threads) - w/ hardware - Rely on as much hardware support as possible
- References
- Multiscalar Processors, by Sohi, Breach,
Vijaykumar, Intl Symp on Computer Architecture
(ISCA-22), 1995 - Speculative Versioning Cache, by Gopal,
Vijaykumar, Smith, Sohi, Intl Symp. On
High-performance Computer Arch (HPCA-4), 1998
20Speculative Versioning Cache
- Speculative Versioning Cache, by Gopal,
Vijaykumar, Smith, Sohi, Intl Symp. On
High-performance Computer Arch (HPCA-4), 1998 - A design to support data dependence speculation
- Data Dependences
- Memory dependences
- Register dependence compiler handles it
- Need to computer addresses before determining
dependences exist or not - Need to distinguish execution order vs. program
order
21Handling DD in Single-Threaded Processor
- Load
- Need to know all previous stores before issuing a
load (avoid violating RAW dependences) - Store
- Need to know (1) its address and (2) all previous
instruction without exceptions (avoid changing
machine states) - Only single version is maintained in the system
- Determining addresses before load/store forces
strict execution order degrades ILP performance
22To Improve ILP Performance
- Allow loads to bypass stores if they are to
different memory locations - Still need to know all their addresses before
issuing loads/stores degrades ILP performance - Issue a load as soon as its address is known even
before all previous stores are completed - Data dependence speculation assume all loads
are to addresses different from stores - Allow loads and stores to be executed out of
order - Maintaining multiple versions of data in the same
location
23Architecture Support Needed for Such Improvement
- A re-ordering buffer (queue) and address
comparison mechanism to maintain program order of
all loads and stores - This is one simple form of speculative versioning
in single-threaded execution model - Multi-threaded execution model
- A hierarchical execution model
- Multiple streams of executions
- Intra-stream (same as in single stream)
- Inter-stream (Similar to ARB, each entry maintain
all versions at the same memory location)
24Speculative Versioning Cache (SVC)Introduction
- Limitation of ARB
- A single shared queue for all threads (needs very
large bandwidth for the queue) - When a task completes, writeback is needed
- Create bursty traffic
- Delay new task initiation
- SVC
- Use private cache coherence protocol
- Cache hits no bus traffic
- Task commits no need for writeback in bursty
mode
25Speculative Versioning Cache (SVC)Introduction
- SVC tracks loads/stores from all threads
- A load must eventually read the data from the
most recent store to the same location - A load must be squashed and re-executed if it
violates such an order - All stores to the same location that follow the
load must be buffered until the load is done (to
guard against possible violation) - Speculative versions of a memory location must
eventually be committed in its original program
order
26SVC Load Operation
- A load is executed as soon as its address becomes
available - Speculate that all stores are to different memory
location - The closest version of the memory location from
the same/different tasks is supplied to the load - The load is recorded to check for potential
violation from an earlier store - All stores after the load must be kept to prepare
for the possible violation of the load (squashing)
27SVC Store and Commit
- When a store is executed, it broadcast the data
to all later active tasks - Any later task detecting the violation of any
load (to the same memory location) must be
squashed - All tasks after the violating task are also
squashed (simple squashing model) - Oldest task (non-speculative) commits its memory
state to architecture storage - Commits involve logically copying from SVC to
architecture storage - Example Figure 2
- Solid arrowhead (program order) hollow arrowhead
(execution order) - 1st snap shot before task 1 executes a store and
detect violation
28SVC Design
- Table 1 memory operations/commit/squash
- Multiprocessor caches maintain only one version
of each data in all caches SVC allows multiple
versions of the same data in all caches - Use version control logic (VOL) to maintain
program order among multiple versions of the same
data - Use pointers similar to SCI (scalable coherence
interface) to maintain a linked list to implement
VOL
29SVC Design
- Figure 3 A SMP coherent cache system
- A cache line has three states Invalid, Clean,
Dirty (Store) - Hit
- Load - to a clean/dirty cache line
- Store to a dirty line
- Miss
- Otherwise including store hit to a clean copy
- A busRead (BusWrite) request for (load/store) is
issued to the bus - Replacement
- BusWBack to cast out dirty cache lines
30SVC Design
- Example Figure 4
- Empty boxes no data
- First snapshot before z issue a load
- 2nd snapshot after load is completed
- 3rd snapshot Y issues a store
- 4th snapshot Y replaces a cache line
31Base SVC Design
- Figure 5
- Assumption one word per cache line
- A load bit is added (Figure 6)
- In initial invalid state, the load bit is set
- Used to detect load-before-store violation
- A pointer is added to point to the next
cache/processor that has the next copy/version of
the same data as described in VOL, i.e. VOL is
implemented as a linked-list - Pointers are pointing to caches/processors not to
tasks
32Base SVC Design
- Add version control logic (VCL) to support VOL
- Hit no need to consult VCL
- Miss consult VCL (Figure 5)
- Behave like ARB
- Load misses
- VCL locates the closet previous version by
searching VOL in reverse order from the requestor - Task assignment is consulted to determine the
pointer of requestor to VOL - Example Figure 7
- Task 2 issues a load causing a miss
- Y (not W) provides the data
33Base SVC Design
- Task commits
- All dirty lines are immediately written back
- Need to maintain a list of stores executed by the
task - All other lines are invalidated
- Task squashes
- All cache lines are squashed
- Drawbacks of such a scheme
- Bursty writeback traffic still not taken care of
(Need to distribute traffic over time) - Clean cache lines are invalidated after commits
resulting in cold cache for next new task (retain
read-only data)
34Supedrthreaded Processors
- Exploit thread-level parallelism
- control speculation (not stopped by one branch
instructions) - w/ hardware - Data synchronization (not stopped by data
dependences between threads) - w/ compiler - Rely on as much compiler/software support as
possible - Reference
- The Superthreaded Processor Architecture, Tsai,
et al IEEE Trans. On Computers, Sept 1999
35A Possible Hardware Implementation of Multithread
Processors
- A 1GIP 1W Single-Chip Tightly-Coupled Four-Way
Multiprocessors with Architecture Support for
Multiple Control Flow Execution, by Toii, et al
2000 IEEE Intl Solid-State Circuits Conf. - One possible implementation of a hybrid of
Multiscalar and Superthreaded
36Single-Program Speculative Multithreading (SPSM)
- Single-Program Speculative Multithreading (SPSM)
Architecture Compiler-Assisted Fine-Grained
Multithreading, by P.D. Dubey, et al, PACT 1995 - Advantages
- Disadvantages
37Multiprocessor on a Chip(MOAC)
- Speculative Multithreaded Processors, by
Macuello, Gonzalez, Tubella, ICS 98 - A Scalable Approach to Thread-Level Speculation,
Steffan, Colohan, Zhai, Mowry, ISCA 2000 - Hardware and Software Support for Speculative
Execution of Sequential Binaries on a
Chip-Multiprocessors, Krishnan and Torrellas,
ICS98
38Simultaneous Multithreading (SMT) Architectures
- Simultaneous Multithreading Maximizing On-Chip
Parallelism, by Tullsen, Eggers, Levy, ISCA-22,
1995 - Converting Thread-Level Parallelism to
Instruction-Level Parallelism via Simultaneous
Multithreading, by Eggers et al, ACM TOCS, Aug
1997 - Tuning Compiler Optimization for Simultaneous
Multithreading, Lo, et al, MICRO-30, 1997
39Latency Hiding Using MultithreadingTera Computer
- Trading parallelism for memory latency
- Need very large memory bandwidth to support this
scheme - Trading memory latency with memory bandwidth
- Tera is too ambitious
- New architecture, new device technology and new
compiler technology
40How Much Potential Thread-Level Parallelism in
SPEC Benchmarks?
- MaxPar simulation tools
- How does it work?
- Different machine models suitable to be simulated
by such an approach - Some simulation results
- Useful tools for
- Identify thread-level parallelism
- Study various thread execution model using real
benchmarks
41Other Issues Related to Multithreaded
Architectures
- Value Prediction for Speculative Multithreaded
Architectures, Marcuello, Tubella, Gonzalez,
MICRO,1999 - Correctly Implementing Value Prediction in
Microprocessors that Support Multithreading or
Multiprocessing, Martin, et al MICRO 2001 - The Need for Fast Communication in Hardware-Based
Speculative Chip Multiprocessors, Krishnan and
Torrellas, PACT 1999