Lecture 3 Multithreaded Processors an Overview - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Lecture 3 Multithreaded Processors an Overview

Description:

None. Context Switch Mechanism. Resources shared between threads. MT Approach. 1/14/2003 ... Switch contexts only when current thread stalls on a long-latency event. ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 14
Provided by: juny8
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3 Multithreaded Processors an Overview


1
Lecture 3 Multithreaded Processorsan Overview
  • Instructor Jun Yang

2
Motivation
  • To improve processor resource utilization.
  • Processor idle cycles, and therefore idle
    functional units.
  • ILP isnt enough. ?exploit thread-level
    parallelism (TLP).
  • Run multiple threads on same processor.

3
Categories
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
4
Resource Sharing and Context Switching
5
Categories
  • Independent threads are identified by compiler.
  • Fine-Grained Multithreading (FGMT)
  • Switch to a different thread on every cycle.
  • Sacrifice single-thread performance for overall
    throughput.
  • CDC-6600 (60s), Denelcor HEP (70s), Tera MTA
    (recent)
  • Coarse-Grained Multithreading (CGMT)
  • Switch contexts only when current thread stalls
    on a long-latency event.
  • Makes the most sense on an in-order processor.
  • Switching penalty shadow all pipeline latches
    vs. not to.
  • Prevent starvation of threads preemptively
    forcing thread switch prioritize threads
  • IBM Northstar/Pulsar 30 ? instruction
    throughput with 10 are cost.

6
Categories
  • Simultaneous Multithreading (SMT, 9495)
  • Process instr. from different thread on every
    cycle.
  • Based on modern out-of-order processor
  • OOO allows instr. from different threads to
    mingle, maximizing resource utilization in
    pipeline stages, reorder buffer, issue queue,
    load/store queue etc.
  • Logic registers are renamed to shared physical
    register pool, removing the need to track threads
    when resolving data dependences.
  • SMT resource sharing

7
SMT Resource Sharing Alternatives
Fetch0
Fetch1
Fetch0
Fetch1
Decode
Decode
Decode
Decode
Rename
Rename
Rename
Issue
Issue
Issue
Ex
Ex
Mem
Mem
Retire0
Retire1
Retire0
Retire1
8
SMT Sharing of Pipeline Stages
  • Fetch
  • Time sharing over a single-ported I-cache within
    a single stage fetch multiple instr. from
    different threads.
  • Dedicate a fetch stage for each thread.
  • Sharing branch predictor would degrade
    performance greatly global history, return
    address stack are all mixed up by interleaved
    threads, ? beneficial to replicate the branch
    predictor for each thread.
  • Decode
  • Threads are compiled with no dependences between
    them, ? It makes sense to separate decoder.
    However, it might hurt single thread execution
    since it might need large decoder (high ILP)
    instead of partitioned one.
  • Renaming
  • Logic register names are disjoint across threads,
    ? rename table can be partitioned. Again, it
    might limit single-thread performance.

9
SMT Sharing of Pipeline Stages
  • Issue wakeup-and-select
  • Waking up instr. that are data ready, and select
    from data-ready pool.
  • Wakeup is intrathread
  • Select must involve instructions from more than
    one thread.
  • Execution
  • Bypass (data forwarding) network can be
    simplified, e.g. cycle-time critical
    ALU-output-to-ALU-input bypass can be relieved by
    executing instruction from different threads.
    Possible to compromise single-thread performance.
  • Memory
  • Sharing load/store queue that resolves memory
    dependences is complex.
  • Certain applications may not permit forwarding
    data from store in one thread to aliased load in
    another thread, ? the queue must be enhanced to
    be thread-aware.
  • Alternatively, provide separate l/s queue for
    each thread. Restricting sharing and performance
    per thread.

10
SMT Sharing of Pipeline Stages
  • Retire
  • Update renaming mapping, write results into logic
    registers etc.
  • Can use separate unit or a single unit in
    fine-grained or coarse-grained manner.
  • Research to date does not make a clear case for
    any of the resource-sharing alternatives.
  • Pentium 4 SMT design (Hyperthreading)
  • Support 2 threads.
  • Share most of the issue, execute and memory
    stages
  • Fine-grained sharing of the front end the retire
    stages.
  • 16 to 28 throughput improvement for Pentium 4
    design when running server workloads with
    abundant TLP.

11
Reading list
  • Simultaneous Multithreading Maximizing On-Chip
    Parallelism
  • Dean Tullsen, Susan Eggers, and Henry Levy,
    Proceedings of the 22rd Annual International
    Symposium on Computer Architecture, June 1995,
    pages 392-403.
  • Exploiting Choice Instruction Fetch and Issue on
    an Implementable Simultaneous Multithreading
    Processor
  • Dean Tullsen, Susan Eggers, Joel Emer, Henry
    Levy, Jack Lo, and Rebecca Stamm, Proceedings of
    the 23rd Annual International Symposium on
    Computer Architecture, May 1996, pages 191-202.
  • Converting Thread-Level Parallelism Into
    Instruction-Level Parallelism via Simultaneous
    Multithreading
  • Jack Lo, Susan Eggers, Joel Emer, Henry Levy,
    Rebecca Stamm, and Dean Tullsen, ACM Transactions
    on Computer Systems, August 1997, pages 322-354.
  • Exploiting Thread-Level Parallelism on
    Simultaneous Multithreaded Processors
  • Jack Lo's thesis, 1998.

12
Reading list (continued)
  • Variability in Architectural Simulations of
    Multi-threaded Workloads
  • HPCA'03 Alaa Alameldeen and David Wood
  • Mini-threads Increasing TLP on Small-Scale SMT
    Processors
  • HPCA'03 Joshua Redstone, Susan Eggers, and
    Henry Levy
  • Instruction Fetch Deferral using Static Slack
  • Gregory A. Muthler, David Crowe, Sanjay J. Patel,
    and Steven S. Lumetta, MICRO35, 2002, pages 5161
  • Pointer Cache Assisted Prefetching
  • Jamison Collins, Suleyman Sair, Brad Calder, and
    Dean M. Tullsen, MICRO35, 2002, pages 6273
  • Dynamic Speculative Precomputation.
  • Jamison Collins, Dean Tullsen, Hong Wang, John
    Shen, MICRO34, 2001
  • Handling Long-latency Loads in a Simultaneous
    Multithreading Processor.
  • Dean M. Tullsen, Jeffery A. Brown, MICRO34, 2001

13
Project 5
  • Develop a simple coarse-grained MT.
  • Simple ISA
  • Four threads, separate register file, simple
    scheduling, state machine and context switching
  • Single issue 5-6 stage pipeline.
  • Based on existing 5-stage pipeline that we
    developed from 203A, do necessary modifications.
  • Based on Intel IXP1200 Microengine. Will cover
    more in next lecture.
Write a Comment
User Comments (0)
About PowerShow.com