Hyper-Threading, Chip multiprocessors and both - PowerPoint PPT Presentation

About This Presentation
Title:

Hyper-Threading, Chip multiprocessors and both

Description:

Title: Hyperthreading Author: Neilin Chakrabarty Last modified by: Zoran Jovanovic Created Date: 4/3/2004 3:49:55 AM Document presentation format – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 31
Provided by: NeilinCha1
Category:

less

Transcript and Presenter's Notes

Title: Hyper-Threading, Chip multiprocessors and both


1
Hyper-Threading, Chip multiprocessors and both
  • Zoran Jovanovic

2
To Be Tackled in Multithreading
  • Review of Threading Algorithms
  • Hyper-Threading Concepts
  • Hyper-Threading Architecture
  • Advantages/Disadvantages

3
Threading Algorithms
  • Time-slicing
  • A processor switches between threads in fixed
    time intervals.
  • High expenses, especially if one of the processes
    is in the wait state. Fine grain
  • Switch-on-event
  • Task switching in case of long pauses
  • Waiting for data coming from a relatively slow
    source, CPU resources are given to other
    processes. Coarse grain

4
Threading Algorithms (cont.)
  • Multiprocessing
  • Distribute the load over many processors
  • Adds extra cost
  • Simultaneous multi-threading
  • Multiple threads execute on a single processor
    without switching.
  • Basis of Intels Hyper-Threading technology.

5
Hyper-Threading Concept
  • At each point of time only a part of processor
    resources is used for execution of the program
    code of a thread.
  • Unused resources can also be loaded, for example,
    with parallel execution of another
    thread/application.
  • Extremely useful in desktop and server
    applications where many threads are used.

6
Quick Recall Many Resources IDLE!
For an 8-way superscalar.
From Tullsen, Eggers, and Levy, Simultaneous
Multithreading Maximizing On-chip Parallelism,
ISCA 1995.
7
(No Transcript)
8
  • A superscalar processor with no multithreading
  • A superscalar processor with coarse-grain
    multithreading
  • A superscalar processor with fine-grain
    multithreading
  • A superscalar processor with simultaneous
    multithreading (SMT)

9
Simultaneous Multithreading (SMT)
  • Example new Pentium with Hyperthreading
  • Key Idea Exploit ILP across multiple threads!
  • i.e., convert thread-level parallelism into more
    ILP
  • exploit following features of modern processors
  • multiple functional units
  • modern processors typically have more functional
    units available than a single thread can utilize
  • register renaming and dynamic scheduling
  • multiple instructions from independent threads
    can co-exist and co-execute!

10
Hyper-Threading Architecture
  • First used in Intel Xeon MP processor
  • Makes a single physical processor appear as
    multiple logical processors.
  • Each logical processor has a copy of architecture
    state.
  • Logical processors share a single set of physical
    execution resources

11
Hyper-Threading Architecture
  • Operating systems and user programs can schedule
    processes or threads to logical processors as if
    they were in a multiprocessing system with
    physical processors.
  • From an architecture perspective we have to worry
    about the logical processors using shared
    resources.
  • Caches, execution units, branch predictors,
    control logic, and buses.

12
Power 5 dataflow ...
  • Why only two threads?
  • With 4, one of the shared resources (physical
    registers, cache, memory bandwidth) would be
    prone to bottleneck
  • Cost
  • The Power5 core is about 24 larger than the
    Power4 core because of the addition of SMT support

13
Advantages
  • Extra architecture only adds about 5 to the
    total die area.
  • No performance loss if only one thread is active.
    Increased performance with multiple threads
  • Better resource utilization.

14
Disadvantages
  • To take advantage of hyper-threading performance,
    serial execution can not be used.
  • Threads are non-deterministic and involve extra
    design
  • Threads have increased overhead
  • Shared resource conflicts

15
Multicore
  • Multiprocessors on a single chip

16
Basic Shared Memory Architecture
  • Processors all connected to a large shared memory
  • Where are caches?

P2
P1
Pn
interconnect
memory
  • Now take a closer look at structure, costs,
    limits, programming

17
What About Caching???
  • Want High performance for shared memory Use
    Caches!
  • Each processor has its own cache (or multiple
    caches)
  • Place data from memory into cache
  • Writeback cache dont send all writes over bus
    to memory
  • Caches Reduce average latency
  • Automatic replication closer to processor
  • More important to multiprocessor than
    uniprocessor latencies longer
  • Normal uniprocessor mechanisms to access data
  • Loads and Stores form very low-overhead
    communication primitive
  • Problem Cache Coherence!

18
Example Cache Coherence Problem
P
P
P
2
1
3



I/O devices
  • Things to note
  • Processors could see different values for u after
    event 3
  • With write back caches, value written back to
    memory depends on happenstance of which cache
    flushes or writes back value when
  • How to fix with a bus Coherence Protocol
  • Use bus to broadcast writes or invalidations
  • Simple protocols rely on presence of broadcast
    medium
  • Bus not scalable beyond about 64 processors (max)
  • Capacity, bandwidth limitations

Memory
19
Limits of Bus-Based Shared Memory
  • Assume
  • 1 GHz processor w/o cache
  • gt 4 GB/s inst BW per processor (32-bit)
  • gt 1.2 GB/s data BW at 30 load-store
  • Suppose 98 inst hit rate and 95 data hit rate
  • gt 80 MB/s inst BW per processor
  • gt 60 MB/s data BW per processor
  • 140 MB/s combined BW
  • Assuming 1 GB/s bus bandwidth
  • \ 8 processors will saturate bus

I/O
MEM
MEM

140 MB/s

cache
cache
5.2 GB/s
PROC
PROC
20
(No Transcript)
21
Cache Organizations for Multi-cores
  • L1 caches are always private to a core
  • L2 caches can be private or shared
  • Advantages of a shared L2 cache
  • efficient dynamic allocation of space to each
    core
  • data shared by multiple cores is not replicated
  • every block has a fixed home hence, easy to
    find
  • the latest copy
  • Advantages of a private L2 cache
  • quick access to private L2 good for small
    working sets
  • private bus to private L2 ? less contention

22
A Reminder SMT (Simultaneous Multi Threading)
SMT vs. CMP
23
A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97
Superscalar (SS)
  • For Same area (a billion tr. DRAM area)
  • Superscalar and SMT Very Complex
  • Wide
  • Advanced Branch prediction
  • Register Renaming
  • OOO Instruction Issue
  • Non-Blocking data caches

CMP
24
SS and SMT vs. CMP
  • CPU Cores Three main hardware design problems
    (of SS and SMT)
  • Area increases quadratically with core complexity
  • Number of Registers O(Instruction window size)
  • Register ports - O(Issue width)
  • CMP solves this problem ( linear Area to Issue
    width)
  • Longer Cycle Times
  • Long Wires, many MUXes and crossbars
  • Large buffers, queues and register files
  • Clustering (decreases ILP) or Deep Pipelining
    (Branch mispredication penalties)
  • CMP allows small cycle time (with little effort)
  • Small and fast
  • Relies on software to schedule
  • Poor ILP
  • Complex Design and Verification

25
SS and SMT vs. CMP
  • Memory
  • 12 issue SS or SMT require multiport data cache
    (4-6 ports)
  • 2 X 128 Kbyte (2 cycle latency)
  • CMP 16 X 16 Kbyte (single cycle latency), but
    secondary cache is slower (multiport)
  • Shared memory write through caches

CMP
26
Performance comparison
  • Compress (Integer apps) Low ILP and no TLP
  • Mpeg-2 (MMedia apps) High ILP and TLP and
    moderate memory requirement (parallelized by
    hand)
  • SMT utilizes core resources better
  • But CMP has 16 issue slots instead of 12
  • Tomcatv (FP applications) Large loop-level
    parallelism and large memory bandwidth (TLP by
    compiler)
  • CMP has large memory bandwidth on
    primary cache - SMT fundamental problem
    unified and slow cache
  • Multiprogram Integer multiprogramming
    workload, all computation-intensive (Low ILP,
    High PLP)

27
CMP Motivation
  • How to utilize available silicon?
  • Speculation (aggressive superscalar)
  • Simultaneous Multithreading (SMT,
    Hyperthreading)
  • Several processors on a single chip
  • What is a CMP (Chip MultiProcessor)?
  • Several processors (several masters)
  • Both shared and distributed memory architectures
  • Both homogenous and heterogeneous processor
    types
  • Why?
  • Wire Delays
  • Diminishing of Uniprocessors
  • Very long design and verification times for
    modern processors

28
A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97
  • TLP and PLP become widespread in future
    applications
  • Various Multimedia applications
  • Compilers and OS
  • Favours CMP
  • CMP
  • Better performance with simple hardware
  • Higher clock rates, better memory bandwidth
  • Shorter pipelines
  • SMT has better utilizations but CMP has more
    resources (no wide-issue logic)
  • Although CMP bad for no TLP and ILP (compress),
    SMT and SS not much better

29
A Reminder SMT (Simultaneous Multi Threading)
CMP
SMT
  • Pool of execution units (Wide machine)
  • Several Logical processors
  • Copy of State for each
  • Mul. Threads are running concurrently
  • Better utilization and Latency Tolerance
  • Simple Cores
  • Moderate amount of parallelism
  • Threads are running concurrently on different
    cores

30
SMT Dual-core all four threads can run
concurrently

L1 D-Cache D-TLB
L1 D-Cache D-TLB
Integer
Floating Point
Integer
Floating Point
Schedulers
Schedulers
L2 Cache and Control
Uop queues
Uop queues
L2 Cache and Control
Rename/Alloc
Rename/Alloc
Trace Cache
uCode ROM
BTB
Trace Cache
uCode ROM
BTB
Decoder
Decoder
Bus
Bus
BTB and I-TLB
BTB and I-TLB
Thread 1
Thread 3
Thread 2
Thread 4
Write a Comment
User Comments (0)
About PowerShow.com