Simultaneous Multithreading:Maximising On-Chip Parallelism presentation

About This Presentation

Transcript and Presenter's Notes

Title: Simultaneous Multithreading:Maximising On-Chip Parallelism

1
Simultaneous MultithreadingMaximising On-Chip
Parallelism
Dean Tullsen, Susan Eggers, Henry Levy Department
of Computer Science, University of
Washington,Seattle Proceedings of ISCA 95, Italy
Presented by Amit Gaur
2
Overview

Instruction Level Parallelism vs. Thread Level
Parallelism
Motivation
Simulation Environment and Workload
Simultaneous Multithreading Models
Performance Analysis
Extensions in Design
Single Chip Multiprocessing
Summary
Current Implementations
Retrospective

3
Instruction Level Parallelism

Superscalar processors
Shortcomings
a) Instruction Dependencies
b) long latencies within single thread

4
Thread Level Parallelism

Traditional Multithreaded Architecture
Exploit parallelism at application level
Multiple threads Inherent Parallelism
Attack Vertical Waste memory and functional unit
latencies
E.g. Server applications, online transaction
processing, web services

5
Need for Simultaneous Multithreading

Attack vertical as well as horizontal waste
Fetch instructions from multiple threads each
cycle
Exploit all parallelism full utilization of
execution resources
Decrease in wasted issue slots
Comparison with superscalar,fine-grain
multithreaded processor, single-chip,multiple
issue multiprocessors

6
Simulation Environment

Emulation based instruction level simulation
Model on Alpha AXP 21164 extended for wide
superscalar execution and multithreaded execution
Support for increased single stream
parallelism,more flexible instruction issue,
improved branch prediction, and larger higher
bandwidth caches
Code generated using Multiflow trace scheduling
compiler(static scheduling)

7
Simulation Environment(Continued)

10 functional units(4 integer, 2 floating point,
3 Load/Store, 1 Branch)
All units pipelined
In-order issue of dependence free instructions
with 8 instruction per thread window
L1 and L2 cache are on-chip
2048 entry, 2 bit branch prediction history
table maintained
Support for upto 8 hardware contexts

8
Workload Specifications

SPEC92 Benchmark suite simulated
To obtain TLP, distinct program allocated to each
thread Parallel workload based on
multiprogramming
Executable generated with lowest single thread
execution time used

9
Limitations of Superscalar Processors
10
Superscalar Performance Degradation

Overlap in a number of delaying causes
Completely eliminating any 1 cause will not
result in performance increase
61 vertical waste and 39 horizontal waste
Tackle both using simultaneous multithreading

11
Simultaneous Multithreading Models

Fine Grain Multithreading 1 thread issues
instructions in each cycle
SMFull Simultaneous Issue All eight threads
compete for each issue slot, each cyclegt Maximum
flexibility.
SMSingle Issue, SM Dual Issue, SMFour Issue
limits the number of instructions each thread can
issue, or have active in the scheduling window,
each cycle.
SM Limited Connection Each hardware context is
connected to exactly one type of functional
unitgt Least Dynamic of all Models.

12
Hardware Complexities of Models
13
Design Challenges in SMT processors

Issue slot usage limited by imbalances in
resource needs and resource availability
Number of active threads, limitations on buffer
sizes, instruction mix from multiple threads
Hardware complexity need to implement
superscalar along with thread level parallelism
Use of priority threads can result in throughput
reduction as pipeline less likely to have
instruction mix from different threads
Mixing many threads also compromises performancce
of individual threads.
Tradeoff- small number of active threads, even
smaller number of preferred threads

14
From Superscalar to SMT

SMT is an out of order superscalar extended with
hardware to support multiple threads
Multiple Thread Support
a) per-thread program counters
b) per-thread return stacks
c) per-thread bookkeeping for instruction
retirement,trap and instruction dispatch from
prefetch queue
d) thread identifiers eg. With BTB and TLB
entries
Should SMT processors speculate??
Determine role of instruction speculation in SMT.

15
Instruction Speculation

Speculation executes probable instructions to
hide branch latencies
Processor fetches on a hardware based prediction
Correct prediction - Keep going
Incorrect prediction - Rollback
SMT has 2 ways to deal with branch delay stalls
a) Speculation
b) Fetch/Issue from other threads
SMT and Speculation
Speculation can be wasteful on SMT as one
threads speculative instructions can compete
with replace anothers non speculative
instructions

16
Performance Evaluation of SMT
17
Performance Evaluation(Contd.)

Fine Grain MT Max Speedup is 2.1. No gain in
vertical waste reduction after 4 threads
SMT models Speedup ranges from 3.5 to 4.2, with
issue rate reaching 6.3 IPC
4 issue model gets nearly same performance as
full issue, dual issue is at 94 of full issue at
8 threads
As ratio of threads to issue slots increases
performance of models increases.
Tradeoff between number of hardware contexts and
hardware complexity.
Adverse effect of competition for sharing of
resources -gt lowest priority thread runs slowest
More strain on caches due to reduced locality-
increase in I and D cache misses
Overall increase in instruction throughput

18
Extensions Alternative cache Design for SMT

Comparison of private per thread caches(L1) to
shared caches for Instructions and Data.
Shared caches optimize for small number of
threads
Shared d-cache outperforms private d-cache for
all configurations.
Private I-caches perform better at high number
of threads.

19
Speculation in SMT
20
SMT vs. Single chip Multiprocessing

Similarities use of multiple register sets,
multiple functional units, need for high issue
bandwidth on single chip
Differences Multiprocessor uses static
allocation of resources, SM processor allows
resource allocation to change every cycle.
Same configuration used for testing performance
a) 8KB private I-cache and D-cache
b) 256 KB 4-way set assoc.. L2 cache
c) 2 MB direct mapped L3 cache
Attempt to bias the test in favor of MP

21
Test Results
22
Test Results(Contd.)

Test A,B,C high ratio of FU and threads to
issue bandwidth- greater opportunity to utilize
issue bandwidth.
Test D repeats A but SMT Processor has 10 FUs.
It still outperforms Multiprocessor
Test E F- MP is allowed greater issue bandwidth
even then SMT processor shows better performance
Test G -both have 8 FUs and 8 issues per
cycle, however SMT processor has 8 contexts and
Multiprocessor has 2 processor (2 register
sets)-SMT processor has 2.5 greater performance

23
Summary

Simultaneous Multithreading combines facilities
of superscalar as well as multithreaded
architectures
It has the ability to boost utilization of
resources by dynamically scheduling functional
units among multiple threads
Comparison of several models of SMT have been
done with wide superscalar, fine-grain
multithreaded, and single chip, multiple issue
multiprocessing architectures
The results of simulation show that
a) a simultaneous multithreaded architecture
with proper configuration can achieve 4 times
instruction throughput of a single-threaded wide
superscalar with the same issue width
b)simultaneous multithreading outperforms
fine-grain multithreading by a factor of 2.
c)simultaneous multiprocessor is superior in
performance to a multiple issue multiprocessor,
given same hardware resources

24
Commercial Machines

MemoryLogix - SMT processor for mobile devices.
Sun Microsystems has announced a 4-SMT-processor
CMP.
Hyper-Threading Technology (Intel Xeon
Architecture)
Clearwater Networks , a Los Gatos-based startup,
was building an 8-context SMT network processor.
Compaq Computer Corp. designed a 4-context SMT
processor, Alpha 21464 (EV-8)

25
In Retrospect

The design of SMT architecture was influenced by
previous projects like the Tera, MIT Alewife and
M-machine
SMT was different from previous projects as it
addressed a more complete and descriptive goal as
compared to previous designs.
The idea was to utilize thread level parallelism
in place of lack of instruction level parallelism
Aim was to target mainstream processor designs
like the Alpha 21164

Write a Comment

User Comments (0)

About PowerShow.com

Simultaneous Multithreading:Maximising On-Chip Parallelism PowerPoint PPT Presentation