Lecture 3 Multithreaded Processors an Overview - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Lecture 3 Multithreaded Processors an Overview

Description:

None. Context Switch Mechanism. Resources shared between threads. MT Approach. 1/14/2003 ... Switch contexts only when current thread stalls on a long-latency event. ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 14

Provided by: juny8

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 3 Multithreaded Processors an Overview

1
Lecture 3 Multithreaded Processorsan Overview

Instructor Jun Yang

2
Motivation

To improve processor resource utilization.
Processor idle cycles, and therefore idle
functional units.
ILP isnt enough. ?exploit thread-level
parallelism (TLP).
Run multiple threads on same processor.

3
Categories
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
4
Resource Sharing and Context Switching
5
Categories

Independent threads are identified by compiler.
Fine-Grained Multithreading (FGMT)
Switch to a different thread on every cycle.
Sacrifice single-thread performance for overall
throughput.
CDC-6600 (60s), Denelcor HEP (70s), Tera MTA
(recent)
Coarse-Grained Multithreading (CGMT)
Switch contexts only when current thread stalls
on a long-latency event.
Makes the most sense on an in-order processor.
Switching penalty shadow all pipeline latches
vs. not to.
Prevent starvation of threads preemptively
forcing thread switch prioritize threads
IBM Northstar/Pulsar 30 ? instruction
throughput with 10 are cost.

6
Categories

Simultaneous Multithreading (SMT, 9495)
Process instr. from different thread on every
cycle.
Based on modern out-of-order processor
OOO allows instr. from different threads to
mingle, maximizing resource utilization in
pipeline stages, reorder buffer, issue queue,
load/store queue etc.
Logic registers are renamed to shared physical
register pool, removing the need to track threads
when resolving data dependences.
SMT resource sharing

7
SMT Resource Sharing Alternatives
Fetch0
Fetch1
Fetch0
Fetch1
Decode
Decode
Decode
Decode
Rename
Rename
Rename
Issue
Issue
Issue
Ex
Ex
Mem
Mem
Retire0
Retire1
Retire0
Retire1
8
SMT Sharing of Pipeline Stages

Fetch
Time sharing over a single-ported I-cache within
a single stage fetch multiple instr. from
different threads.
Dedicate a fetch stage for each thread.
Sharing branch predictor would degrade
performance greatly global history, return
address stack are all mixed up by interleaved
threads, ? beneficial to replicate the branch
predictor for each thread.
Decode
Threads are compiled with no dependences between
them, ? It makes sense to separate decoder.
However, it might hurt single thread execution
since it might need large decoder (high ILP)
instead of partitioned one.
Renaming
Logic register names are disjoint across threads,
? rename table can be partitioned. Again, it
might limit single-thread performance.

9
SMT Sharing of Pipeline Stages

Issue wakeup-and-select
Waking up instr. that are data ready, and select
from data-ready pool.
Wakeup is intrathread
Select must involve instructions from more than
one thread.
Execution
Bypass (data forwarding) network can be
simplified, e.g. cycle-time critical
ALU-output-to-ALU-input bypass can be relieved by
executing instruction from different threads.
Possible to compromise single-thread performance.
Memory
Sharing load/store queue that resolves memory
dependences is complex.
Certain applications may not permit forwarding
data from store in one thread to aliased load in
another thread, ? the queue must be enhanced to
be thread-aware.
Alternatively, provide separate l/s queue for
each thread. Restricting sharing and performance
per thread.

10
SMT Sharing of Pipeline Stages

Retire
Update renaming mapping, write results into logic
registers etc.
Can use separate unit or a single unit in
fine-grained or coarse-grained manner.
Research to date does not make a clear case for
any of the resource-sharing alternatives.
Pentium 4 SMT design (Hyperthreading)
Support 2 threads.
Share most of the issue, execute and memory
stages
Fine-grained sharing of the front end the retire
stages.
16 to 28 throughput improvement for Pentium 4
design when running server workloads with
abundant TLP.

11
Reading list

Simultaneous Multithreading Maximizing On-Chip
Parallelism
Dean Tullsen, Susan Eggers, and Henry Levy,
Proceedings of the 22rd Annual International
Symposium on Computer Architecture, June 1995,
pages 392-403.
Exploiting Choice Instruction Fetch and Issue on
an Implementable Simultaneous Multithreading
Processor
Dean Tullsen, Susan Eggers, Joel Emer, Henry
Levy, Jack Lo, and Rebecca Stamm, Proceedings of
the 23rd Annual International Symposium on
Computer Architecture, May 1996, pages 191-202.
Converting Thread-Level Parallelism Into
Instruction-Level Parallelism via Simultaneous
Multithreading
Jack Lo, Susan Eggers, Joel Emer, Henry Levy,
Rebecca Stamm, and Dean Tullsen, ACM Transactions
on Computer Systems, August 1997, pages 322-354.
Exploiting Thread-Level Parallelism on
Simultaneous Multithreaded Processors
Jack Lo's thesis, 1998.

12
Reading list (continued)

Variability in Architectural Simulations of
Multi-threaded Workloads
HPCA'03 Alaa Alameldeen and David Wood
Mini-threads Increasing TLP on Small-Scale SMT
Processors
HPCA'03 Joshua Redstone, Susan Eggers, and
Henry Levy
Instruction Fetch Deferral using Static Slack
Gregory A. Muthler, David Crowe, Sanjay J. Patel,
and Steven S. Lumetta, MICRO35, 2002, pages 5161
Pointer Cache Assisted Prefetching
Jamison Collins, Suleyman Sair, Brad Calder, and
Dean M. Tullsen, MICRO35, 2002, pages 6273
Dynamic Speculative Precomputation.
Jamison Collins, Dean Tullsen, Hong Wang, John
Shen, MICRO34, 2001
Handling Long-latency Loads in a Simultaneous
Multithreading Processor.
Dean M. Tullsen, Jeffery A. Brown, MICRO34, 2001

13
Project 5

Develop a simple coarse-grained MT.
Simple ISA
Four threads, separate register file, simple
scheduling, state machine and context switching
Single issue 5-6 stage pipeline.
Based on existing 5-stage pipeline that we
developed from 203A, do necessary modifications.
Based on Intel IXP1200 Microengine. Will cover
more in next lecture.

Write a Comment

User Comments (0)