Nicolas Tjioe CSE 520 Wednesday 11122008 HyperThreading in NetBurst Microarchitecture David Koufaty

About This Presentation

Title:

Nicolas Tjioe CSE 520 Wednesday 11122008 HyperThreading in NetBurst Microarchitecture David Koufaty

Description:

Pipeline the microarchitecture to finer granularities called super pipelining ... Intel Chipset Software Installation. Utility v4.00.1009. Software ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 17

Provided by: Tibo

Category:

more less

Transcript and Presenter's Notes

Title: Nicolas Tjioe CSE 520 Wednesday 11122008 HyperThreading in NetBurst Microarchitecture David Koufaty

1
Nicolas TjioeCSE 520Wednesday
11/12/2008Hyper-Threading in NetBurst
MicroarchitectureDavid KoufatyDeborah T.
MarrIntelPublished by the IEEE Computer
SocietyVolume 23, Issue 2, March-April 2003
Page(s)56 - 65
2
Traditional Processor Design

Higher Clock Speed
Pipeline the microarchitecture to finer
granularities called super pipelining
Instruction Level Parallelism (ILP)
In-Order vs Out-of-Order
Cache Hierarchy
Data on the cache reduces the frequency of
access to the slower main memory

3
Design Cont.

Existing techniques add die-size and power costs.
CMP Full set of execution and architectural
resources.
Time-slice multithreading.
Simultaneous multithreading (SMT).

4
Hyper-Threading (HT)

HT introduces the SMT approach to the Intel
architecture.
A single physical processor appears as multi-core
processors. One copy of the architecture state
for each logical processor sharing a single set
of physical execution resource.
HW more instructions, SW schedule more threads
HT added less than 5 to the relative chip size
and maximum power requirements.

5
Microarchitecture choice tradeoffs

Partition
Dedicating equal resources to each logical
processor.
Simplicity and low complexity.
Good for high structures utilization and
unpredictable.
Eg Pipeline Queue
Threshold
Flexible resource sharing with a limit on the
maximum resource usage.
Ideal for small structures where the resource
utilization is bursty and predictible.
Eg Processor Scheduler
Full Sharing
Flexible resource sharing with no limit on the
maximum resource usage.
Good for large structures in which the working
set-size are variable.
Eg Processor caches.

6
Shared vs Partitioned Queue
Dark color Slower Thread, Light color Faster
Thread
7
HT Resources

Duplicated
Register Renaming Logic
Instruction Pointer
ITLB
Return Stack Predictor
Partitioned
Reorder Buffer (ROB)
Load/Store Buffer
Scheduling queues, uop queues.
Shared
Caches Trace cache, L2.
Execution unit.
Microarchitectural Registers.

8
Front-End Pipeline
Execution Trace Cache Trace Cache (TC) stores
the decoded instructions called Microoperations
(uops). Microcode ROM For complex instruction
where TC sends microcode instruction pointer to
the Microcode ROM. Instruction Translation
Lookup Buffer (ITLB) In case of Trace Cache
Miss, ITLB receives the request from TC to
deliver new instructions and it translates the
next instruction pointer address to a physical
address. Streaming buffers is 64 bytes. IA32
Instruction Decode Decoding is only needed for
instructions that miss the TC. Alternate between
threads, in this way we need two copies of
decoder logic. Uop Queue Each logical processor
has half the entries only (Partitioned). Sends
uops from Front-end pipeline to the Out-of-Order
Execution Engine.
9
Out-of-Order Execution Engine
Each logical processor can use up to a maximum of
63 ROB, 24 load buffers and 12 store buffers.
Allocator It will alternate select uops from
the logical processor at every clock cycle.
Signal stall if limit is reached. Register
Rename Rename the IA32 registers (8) into the
machine physical registers (128). Allow the
instruction to run at the same time with another
instruction that use the same IA32 registers.
Uses RAT to keep track of the registers. Instruct
ion Scheduling Four uops schedulers are used to
schedule different type of uops for different
execution units. Each scheduler has its own queue
of 8-12 entries Retirement Retirement logic
alternate between two logical processors to track
which uops are ready to be retired. Data is
written to the L1 Data cache.
10
Dispatch Execution Units

Maximum of instructions that can be dispatched
is 6
Two microinstructions on Port 0.
Two microinstructions on Port 1.
One microinstruction on Port 2.
One microinstruction on Port 3.
Same port has fast unit combine with the slow
unit.
Port 2, 3 is used for memory operations (load and
store).
After execution, uops are placed in the ROB.

11
Single Task (ST), Multi Task (MT) Mode

Two types of ST Mode ST0 and ST1.
Only one logical processor is active, low-power
mode.
Resources that were partitioned in MT mode are
recombined to give the single logical processors
the entire resources.
HALT instruction is used to transition from MT to
ST mode.
It is a privileged instruction, only ring-0 or OS
can execute it.

12
Experiment Setup
13
Result
Cache hit rate and overall performance impact for
a fully shared cache normalized against values
for a partitioned cache
14
Multithreading Multitasking Performance
HT Performance on Multithreaded Software Package
HT Performance on Multitasking workloads
15
Conclusions

HT improves multithreaded applications by having
each logical processor run software threads from
the same application.
HT speeds up workload consisting of multitasking
applications by multitasking. Each logical
processor run threads from different
applications.
Nehalem (Intel i7) plan to be released in Q4
2008. It scales up to 8 physical cores (16
logical processors).

16
Additional References

Hyper-Threading Technology Architecture and
Microarchitecture
ftp//download.intel.com/technology/itj/2002/volu
me06issue01/art01_hyper/vol6iss1_art01.pdf
http//www.hardwaresecrets.com/article/235/6

Write a Comment

User Comments (0)