On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre

Description:

... parallelism (ILP ... grained (instruction-level) parallelism is no longer ... Exploit primarily loop-level parallelism. Very good parallelizing compiler ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 42
Provided by: penchu
Category:

less

Transcript and Presenter's Notes

Title: On Recent Development in HighPerformance Computer Architectures and Compilation Techniques Multithre


1
On Recent Development in High-Performance
Computer Architectures and Compilation
Techniques(Multithreading)
  • Prof. Pen-Chung Yew
  • Dept. of Computer Science and Eng
  • University of Minnesota
  • http//www.cs.umn.edu/Research/Agassiz

2
Outline
  • Superscalar/superpipeline architectures
  • Ways to further improve performance
  • Multithreaded Processor Architectures
  • Performance enhancement through control
    speculation
  • Performance enhancement through value prediction
  • Performance enhancement through dynamic
    recompilation (not included)

3
Current Directions in Computer Design
  • Continue to stride for higher performance
  • Low power design to improve performance/power
    ratio
  • Improve dependability small feature size
    increases error rates

4
Trends in Hardware Technology
  • Available transistors on a single chip continue
    to grow
  • More functionalities on a single chip, but need
    to consider performance/power tradeoffs (Fig.
    1-3)
  • Exploring more parallelism (ILP/TLP)
  • Clock rate increase limited more by wire delay
    than by gate delay
  • Current single, large-window, wide-issue
    architectures not scalable
  • Use deeper pipeline, multiple simple processing
    units on a chip, or multithreaded architectures.
  • New high-performance special purpose (embedded)
    architectures

5
Software Technology Trends
  • Rely more on compiler technology to simplify
    hardware complexity and to improve efficiency
  • Profile-based techniques and dynamic compilation
    are gaining significance
  • Fine-grained (instruction-level) parallelism is
    no longer sufficient for large issue rate - Need
    to exploit medium-grained (loop-level or
    procedure-level ) parallelism

6
Strength Limitation of Existing Approaches -
Superscalar
  • Strength
  • Multiple-issue per clock cycle
  • Accurate and effective runtime dependence check
  • Out-of-order execution for high ILP
  • HW supported instruction-level speculation on
    values, dependences, and branch outcomes

7
Strength Limitation of Existing Approaches -
Superscalar
  • Limitations
  • A single instruction window is not easy to scale
    (limited by wire delay)
  • Limited to only instruction-level parallelism
    (not enough parallelism)
  • Compilers exploit mostly innermost loops

8
Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)
  • Strength
  • More scalable (allow shorter wires)
  • More flexible (various interconnect schemes)
  • Exploit primarily loop-level parallelism
  • Very good parallelizing compiler technology
    available

9
Strength Limitation of Existing Approaches -
Multiprocessor on a Chip (MOAC)
  • Limitations
  • Communication and sync through memory
  • Higher overhead
  • Need larger granularity
  • Exploit parallel loops (Doall) only
  • May need a coherence cache
  • Primarily for scientific, not general-purpose
    applications

10
Strength and Limitation of Existing
Approaches-VLIW
  • Strength
  • Relatively simple HW support
  • Using compiler to resolve dependences at compile
    time and schedule instructions
  • Trace scheduling
  • Software pipelining
  • Extensive use of profiling information in compiler

11
Strength and Limitation of Existing
Approaches-VLIW
  • Limitation
  • Code not portable, may expand to a large size
  • Independent instructions are executed in locked
    steps
  • Limited to ILP only
  • Compiler exploits only innermost loop
  • Figures

12
Challenges in General-Purpose Applications
  • Mostly Do-while loops
  • Need thread-level speculation
  • not available in current multiprocessors
  • Parallelism exists in outer loops
  • Need thread-level support
  • N/A in superscalar
  • Good in multiprocessors
  • Pointers/aliases complicate dependence analysis
  • Need runtime dependence check both within and
    between threads, or use data speculation
  • N/A in multiprocessor and superscalar

13
Challenges in General-Purpose Applications (Cont.)
  • Small basic blocks in straight-line code
  • Need instruction-level speculative/predicated
    execution, branch prediction, profiling-based
    compilation, out-of-order execution
  • very well developed in superscalars

14
Challenges in General-Purpose Applications (Cont.)
  • Many small loops, Doacross loops and parallel
    sections
  • Need fast, low overhead communication/synchronizat
    ion support
  • Communication directly between register files, or
    special memory buffers without going through
    memory
  • not available in current MOAC

15
Multithreaded Architectures (To Improve
Single-Program Speedup)
  • A hybrid of superscalar and MOAC
  • More scalable than superscalar
  • More general-purpose than MOAC
  • Supporting multiple threads (multiple program
    counters)
  • Handle loop-level parallelism (or parallel
    sections) more easily
  • Not limited to innermost loops

16
Multithreaded Architectures(To Improve Single
Program Speedup)
  • Support thread-level speculation (on data and/or
    control dependences)
  • Handle Do-while loops and Do-across loops
  • Support runtime dependence checking both within
    and between threads or support data speculation
  • Handle pointer/alias problems
  • Allow fast communication/synchronization

17
Multithreaded Architectures (To Improve Single
Program Speedup)
  • Multiscalar
  • Superthreaded Prorcessors
  • NEC multithreaded processor
  • Multiprocessor on a chip (MOAC)
  • Trace processor
  • Single-Program Speculative Multithreading (SPSM)

18
Other Multithreaded Architectures
  • Simultaneous multithreading (SMT)
  • Use multiprogramming to improve throughput
  • Does not consider scalability
  • Use multithreading to hide memory latency (single
    thread active at a time Tera Computer)
  • MaxPar tool sets to study thread-level
    parallelism

19
Multiscalars
  • Exploit thread-level parallelism through
  • control speculation (not stopped by one branch
    instructions) - w/ hardware
  • Data speculation (not stopped by data dependences
    between threads) - w/ hardware
  • Rely on as much hardware support as possible
  • References
  • Multiscalar Processors, by Sohi, Breach,
    Vijaykumar, Intl Symp on Computer Architecture
    (ISCA-22), 1995
  • Speculative Versioning Cache, by Gopal,
    Vijaykumar, Smith, Sohi, Intl Symp. On
    High-performance Computer Arch (HPCA-4), 1998

20
Speculative Versioning Cache
  • Speculative Versioning Cache, by Gopal,
    Vijaykumar, Smith, Sohi, Intl Symp. On
    High-performance Computer Arch (HPCA-4), 1998
  • A design to support data dependence speculation
  • Data Dependences
  • Memory dependences
  • Register dependence compiler handles it
  • Need to computer addresses before determining
    dependences exist or not
  • Need to distinguish execution order vs. program
    order

21
Handling DD in Single-Threaded Processor
  • Load
  • Need to know all previous stores before issuing a
    load (avoid violating RAW dependences)
  • Store
  • Need to know (1) its address and (2) all previous
    instruction without exceptions (avoid changing
    machine states)
  • Only single version is maintained in the system
  • Determining addresses before load/store forces
    strict execution order degrades ILP performance

22
To Improve ILP Performance
  • Allow loads to bypass stores if they are to
    different memory locations
  • Still need to know all their addresses before
    issuing loads/stores degrades ILP performance
  • Issue a load as soon as its address is known even
    before all previous stores are completed
  • Data dependence speculation assume all loads
    are to addresses different from stores
  • Allow loads and stores to be executed out of
    order
  • Maintaining multiple versions of data in the same
    location

23
Architecture Support Needed for Such Improvement
  • A re-ordering buffer (queue) and address
    comparison mechanism to maintain program order of
    all loads and stores
  • This is one simple form of speculative versioning
    in single-threaded execution model
  • Multi-threaded execution model
  • A hierarchical execution model
  • Multiple streams of executions
  • Intra-stream (same as in single stream)
  • Inter-stream (Similar to ARB, each entry maintain
    all versions at the same memory location)

24
Speculative Versioning Cache (SVC)Introduction
  • Limitation of ARB
  • A single shared queue for all threads (needs very
    large bandwidth for the queue)
  • When a task completes, writeback is needed
  • Create bursty traffic
  • Delay new task initiation
  • SVC
  • Use private cache coherence protocol
  • Cache hits no bus traffic
  • Task commits no need for writeback in bursty
    mode

25
Speculative Versioning Cache (SVC)Introduction
  • SVC tracks loads/stores from all threads
  • A load must eventually read the data from the
    most recent store to the same location
  • A load must be squashed and re-executed if it
    violates such an order
  • All stores to the same location that follow the
    load must be buffered until the load is done (to
    guard against possible violation)
  • Speculative versions of a memory location must
    eventually be committed in its original program
    order

26
SVC Load Operation
  • A load is executed as soon as its address becomes
    available
  • Speculate that all stores are to different memory
    location
  • The closest version of the memory location from
    the same/different tasks is supplied to the load
  • The load is recorded to check for potential
    violation from an earlier store
  • All stores after the load must be kept to prepare
    for the possible violation of the load (squashing)

27
SVC Store and Commit
  • When a store is executed, it broadcast the data
    to all later active tasks
  • Any later task detecting the violation of any
    load (to the same memory location) must be
    squashed
  • All tasks after the violating task are also
    squashed (simple squashing model)
  • Oldest task (non-speculative) commits its memory
    state to architecture storage
  • Commits involve logically copying from SVC to
    architecture storage
  • Example Figure 2
  • Solid arrowhead (program order) hollow arrowhead
    (execution order)
  • 1st snap shot before task 1 executes a store and
    detect violation

28
SVC Design
  • Table 1 memory operations/commit/squash
  • Multiprocessor caches maintain only one version
    of each data in all caches SVC allows multiple
    versions of the same data in all caches
  • Use version control logic (VOL) to maintain
    program order among multiple versions of the same
    data
  • Use pointers similar to SCI (scalable coherence
    interface) to maintain a linked list to implement
    VOL

29
SVC Design
  • Figure 3 A SMP coherent cache system
  • A cache line has three states Invalid, Clean,
    Dirty (Store)
  • Hit
  • Load - to a clean/dirty cache line
  • Store to a dirty line
  • Miss
  • Otherwise including store hit to a clean copy
  • A busRead (BusWrite) request for (load/store) is
    issued to the bus
  • Replacement
  • BusWBack to cast out dirty cache lines

30
SVC Design
  • Example Figure 4
  • Empty boxes no data
  • First snapshot before z issue a load
  • 2nd snapshot after load is completed
  • 3rd snapshot Y issues a store
  • 4th snapshot Y replaces a cache line

31
Base SVC Design
  • Figure 5
  • Assumption one word per cache line
  • A load bit is added (Figure 6)
  • In initial invalid state, the load bit is set
  • Used to detect load-before-store violation
  • A pointer is added to point to the next
    cache/processor that has the next copy/version of
    the same data as described in VOL, i.e. VOL is
    implemented as a linked-list
  • Pointers are pointing to caches/processors not to
    tasks

32
Base SVC Design
  • Add version control logic (VCL) to support VOL
  • Hit no need to consult VCL
  • Miss consult VCL (Figure 5)
  • Behave like ARB
  • Load misses
  • VCL locates the closet previous version by
    searching VOL in reverse order from the requestor
  • Task assignment is consulted to determine the
    pointer of requestor to VOL
  • Example Figure 7
  • Task 2 issues a load causing a miss
  • Y (not W) provides the data

33
Base SVC Design
  • Task commits
  • All dirty lines are immediately written back
  • Need to maintain a list of stores executed by the
    task
  • All other lines are invalidated
  • Task squashes
  • All cache lines are squashed
  • Drawbacks of such a scheme
  • Bursty writeback traffic still not taken care of
    (Need to distribute traffic over time)
  • Clean cache lines are invalidated after commits
    resulting in cold cache for next new task (retain
    read-only data)

34
Supedrthreaded Processors
  • Exploit thread-level parallelism
  • control speculation (not stopped by one branch
    instructions) - w/ hardware
  • Data synchronization (not stopped by data
    dependences between threads) - w/ compiler
  • Rely on as much compiler/software support as
    possible
  • Reference
  • The Superthreaded Processor Architecture, Tsai,
    et al IEEE Trans. On Computers, Sept 1999

35
A Possible Hardware Implementation of Multithread
Processors
  • A 1GIP 1W Single-Chip Tightly-Coupled Four-Way
    Multiprocessors with Architecture Support for
    Multiple Control Flow Execution, by Toii, et al
    2000 IEEE Intl Solid-State Circuits Conf.
  • One possible implementation of a hybrid of
    Multiscalar and Superthreaded

36
Single-Program Speculative Multithreading (SPSM)
  • Single-Program Speculative Multithreading (SPSM)
    Architecture Compiler-Assisted Fine-Grained
    Multithreading, by P.D. Dubey, et al, PACT 1995
  • Advantages
  • Disadvantages

37
Multiprocessor on a Chip(MOAC)
  • Speculative Multithreaded Processors, by
    Macuello, Gonzalez, Tubella, ICS 98
  • A Scalable Approach to Thread-Level Speculation,
    Steffan, Colohan, Zhai, Mowry, ISCA 2000
  • Hardware and Software Support for Speculative
    Execution of Sequential Binaries on a
    Chip-Multiprocessors, Krishnan and Torrellas,
    ICS98

38
Simultaneous Multithreading (SMT) Architectures
  • Simultaneous Multithreading Maximizing On-Chip
    Parallelism, by Tullsen, Eggers, Levy, ISCA-22,
    1995
  • Converting Thread-Level Parallelism to
    Instruction-Level Parallelism via Simultaneous
    Multithreading, by Eggers et al, ACM TOCS, Aug
    1997
  • Tuning Compiler Optimization for Simultaneous
    Multithreading, Lo, et al, MICRO-30, 1997

39
Latency Hiding Using MultithreadingTera Computer
  • Trading parallelism for memory latency
  • Need very large memory bandwidth to support this
    scheme
  • Trading memory latency with memory bandwidth
  • Tera is too ambitious
  • New architecture, new device technology and new
    compiler technology

40
How Much Potential Thread-Level Parallelism in
SPEC Benchmarks?
  • MaxPar simulation tools
  • How does it work?
  • Different machine models suitable to be simulated
    by such an approach
  • Some simulation results
  • Useful tools for
  • Identify thread-level parallelism
  • Study various thread execution model using real
    benchmarks

41
Other Issues Related to Multithreaded
Architectures
  • Value Prediction for Speculative Multithreaded
    Architectures, Marcuello, Tubella, Gonzalez,
    MICRO,1999
  • Correctly Implementing Value Prediction in
    Microprocessors that Support Multithreading or
    Multiprocessing, Martin, et al MICRO 2001
  • The Need for Fast Communication in Hardware-Based
    Speculative Chip Multiprocessors, Krishnan and
    Torrellas, PACT 1999
Write a Comment
User Comments (0)
About PowerShow.com