TLP on Chip: SMT and CMP

About This Presentation

Title:

TLP on Chip: SMT and CMP

Description:

... custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol ... build an SMP over a snoopy bus; you can connect these ... – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 53

Provided by: cseIi

Category:

more less

Transcript and Presenter's Notes

Title: TLP on Chip: SMT and CMP

1
TLP on ChipSMT and CMP
2
SMT

Discussed simultaneous multithreading (SMT) in
the last lecture
Basic goal is to run multiple threads at the same
time
Helps in hiding large memory latency because even
if one thread is blocked due to a cache miss, it
is still possible to schedule ready instructions
from other threads without taking the overhead of
context switch
Improves memory level parallelism (MLP)
Overall, improves resource utilization enormously
as compared to a superscalar processor
Latency of a particular thread may not improve,
but the overall throughput of the system
increases (i.e. average number of retired
instructions per cycle)

3
Multi-threading

Three design choices for single-core hardware
multi-threading
Coarse-grain multithreading Execute one thread
at a time when the running thread is blocked on
a long-latency event e.g., cache miss, swap in a
new thread this swap can take place in hardware
(needs extra support and extra cycles for
flushing the pipe and saving register values
unless renamed registers remain pinned)
Fine-grain multithreading Fetch, decode, rename,
issue, execute instructions from threads in round
robin fashion improved utilization across
cycles, but problem remains within cycle also if
a thread gets blocked on a long-latency event its
slots will go wasted for many cycles
Simultaneous multithreading (SMT) Mix
instructions from all threads every cycle
maximum utilization of resources

4
Problems of SMT

Offers a processor that can deliver reasonably
good multithreaded performance with fine-grained
fast communication through cache
Although it is possible to design an SMT
processor with small die area increase (5 in
Pentium 4), for good performance it is necessary
to rethink about resource allocation policies at
various stages of the pipe
Also, verifying an SMT processor is much harder
than the basic underlying superscalar design
Must think about various deadlock/livelock
possibilities since the threads interact with
each other through shared resources on a
per-cycle basis
Why not exploit the transistors available today
to just replicate existing superscalar cores and
design a single chip multiprocessor (CMP)?

5
CMP

CMP is the mantra of todays microprocessor
industry
Intels dual-core Pentium 4 each core is still
hyperthreaded (just uses existing cores)
Intels quad-core Whitefield is coming up in a
year or so
For the server market Intel has announced a
dual-core Itanium 2 (code named Montecito) again
each core is 2-way threaded
AMD has released dual-core Opteron in 2005
IBM released their first dual-core processor
POWER4 circa 2001 next-generation POWER5 also
uses two cores but each core is also 2-way
threaded
Suns UltraSPARC IV (released in early 2004) is a
dual-core processor and integrates two UltraSPARC
III cores

6
Why CMP?

Today microprocessor designers can afford to have
a lot of transistors on the die
Ever-shrinking feature size leads to dense
packing
What would you do with so many transistors?
Can invest some to cache, but beyond a certain
point it doesnt help
Natural choice was to think about greater level
of integration
Few chip designers decided to bring the memory
and coherence controllers along with the router
on the die
The next obvious choice was to replicate the
entire core it is fairly simple just use the
existing cores and connect them through a
coherent interconnect

7
Moores law

The number of transistors on a die doubles every
18-24 months
Exponential growth in available transistor count
If transistor utilization is constant, this would
lead to exponential performance growth but life
is slightly more complicated
Wires dont scale with transistor technology
wire delay becomes the bottleneck
Short wires are good dictates localized logic
design
But superscalar processors exercise a
centralized control requiring long wires (or
pipelined long wires)
However, to utilize the transistors well, we need
to overcome the memory wall problem
To hide memory latency we need to extract more
independent instructions i.e. more ILP

8
Moores law

Extracting more ILP directly requires more
available in-flight instructions
But for that we need bigger ROB which in turn
requires a bigger register file
Also we need to have bigger issue queues to be
able to find more parallelism
None of these structures scale well main problem
is wiring
So the best solution to utilize these transistors
effectively with a low cost must not require long
wires and must be able to leverage existing
technology CMP satisfies these goals exactly
(use existing processors and invest transistors
to have more of these on-chip instead of trying
to scale the existing processor for more ILP)

9
Moores law
10
Power consumption?

Hey, didnt I just make my power consumption
roughly N-fold by putting N cores on the die?
Yes, if you do not scale down voltage or
frequency
Usually CMPs are clocked at a lower frequency
Oops! My games run slower!
Voltage scaling happens due to smaller process
technology
Overall, roughly cubic dependence of power on
voltage or frequency
Need to talk about different metrics
Performance/Watt (same as reciprocal of energy)
More general, Performancek1/Watt (k gt 0)
Need smarter techniques to further improve these
metrics
Online voltage/frequency scaling

11
Clustered arch.

An alternative to CMP is clustered
microarchitecture
Still tries to extract ILP and runs a single
thread
But divides the execution unit into clusters
where each cluster has a separate register file
Number of ports per register file goes down
dramatically reducing the complexity
Can even replicate/partition caches
Big disadvantage keeping the register file and
cache partitions coherent may need global wires
Key factor frequency of communication
Also, standard problems of single-threaded
execution remain branch prediction, fetch
bandwidth, etc.

12
Clustered arch.
May want to steer dependent instructions to the
same Cluster to minimize communication
13
ABCs of CMP

Where to put the interconnect?
Do not want to access the interconnect too
frequently because these wires are slow
It probably does not make much sense to have the
L1 cache shared among the cores requires very
high bandwidth and may necessitate a redesign of
the L1 cache and surrounding load/store unit
which we do not want to do so settle for private
L1 caches, one per core
Makes more sense to share the L2 or L3 caches
Need a coherence protocol at L2 interface to keep
private L1 caches coherent may use a high-speed
custom designed snoopy bus connecting the L1
controllers or may use a simple directory
protocol
An entirely different design choice is not to
share the cache hierarchy at all (dual-core AMD
and Intel) rids you of the on-chip coherence
protocol, but no gain in communication latency

14
Shared cache design

Need to be banked
How many coherence engines per bank?
Notion of home bank? Miss in home bank means
what?
Snoop or directory?
COMA with home bank?

15
Hierarchical MP

SMT and CMP add couple more levels in
hierarchical multiprocessor design
If you just have an SMT processor, among the
threads you can do shared memory multiprocessing
with possibly the fastest communication you can
connect the SMT processors to build an SMP over a
snoopy bus you can connect these SMP nodes over
a network with a directory protocol
Can do the same thing with CMP, only difference
is that you need to design the on-chip coherence
logic (that is not automatically enforced as in
SMT)
If you have a CMP with each core being an SMT,
then you really have a tall hierarchy of shared
memory the communication becomes costlier as you
go up the hierarchy also communication becomes
very much non-uniform

16
IBM POWER4
17
IBM POWER4

Dual-core chip multiprocessor

18
4-chip 8-way NUMA
19
32-way ring bus
20
POWER4 core

8-wide fetch, 8-wide issue, 5-wide commit
Features out-of-order issue with renaming and
branch prediction (bimodalgshare hybrid)
Allows 20 groups of at most 5 instructions each
to be in-flight beyond dispatch (100 instructions)

21
POWER4 pipeline

Relatively short pipe
Clocked at more than 1 GHz for 0.18µm technology
Minimum 15 cycles for integer instructions
Minimum 12-cycle branch misprediction penalty
11 small parallel issue queues (divided into four
groups) for fast selection
Back-to-back issue of dependent instructions not
allowed slow bypass or bypass absent? Requires
at least one cycle gap
Out-of-order load issue, load-load and load-store
replay load-load replay optimized with load
queue snoop bit
Write through write no allocate private L1 data
cache at most 8 outstanding L1 load misses
Inclusion maintained between L2 and L1

22
POWER4 pipeline
23
POWER4 caches

Private L1 instruction and data caches (on chip)
L1 icache 64 KB/direct mapped/128 bytes line
L1 dcache 32 KB/2-way associative/128 bytes
line/LRU
No M state in L1 data cache (write through)
On-chip shared L2 (on-chip coherence point)
1.5 MB/8-way associative/128 bytes line/pseudo
LRU
For on-chip coherence, L2 tag is augmented with a
two-bit sharer vector used to invalidate L1 on
other cores write
Three L2 controllers and each L2 controller has
four local coherence units each L2 controller
handles roughly 512 KB of data divided into four
SRAM partitions
For off-chip coherence, each L2 controller has
four snoop engines executes enhanced MESI with
seven states

24
POWER4 L2 cache
25
POWER4 L3 cache

On-chip tag (IBM calls it directory), off-chip
data
32 MB/8-way associative/512 bytes line
Contains eight coherence/snoop controllers
Does not maintain inclusion with L2 requires L3
to snoop fabric interconnect also
Maintains five coherence states
Putting the L3 cache on the other side of the
fabric requires every L2 cache miss (even local
miss) to cross the fabric increases latency
quite a bit

26
POWER4 L3 cache
27
POWER4 die photo
28
IBM POWER5
29
IBM POWER5

Carries on POWER4 to the next generation
Each core of the dual-core chip is 2-way SMT 24
area growth per core
More than two threads not only add complexity,
may not provide extra performance benefit in
fact, performance may degrade because of resource
contention and cache thrashing unless all shared
resources are scaled up accordingly (hits a
complexity wall)
L3 cache is moved to the processor side so that
L2 cache can directly talk to it reduces
bandwidth demand on the interconnect (L3 hits at
least do not go on bus)
This change enabled POWER5 designers to scale to
64-processor systems (i.e. 32 chips with a total
of 128 threads)
Bigger L2 and L3 caches 1.875 MB L2, 36 MB L3
On-chip memory controller

30
IBM POWER5
Reproduced from IEEE Micro
31
IBM POWER5

Same pipeline structure as POWER4
Added SMT facility
Like Pentium 4, fetches from each thread in
alternate cycles (8-instruction fetch per cycle
just like POWER4)
Threads share ITLB and ICache
Increased size of register file compared to
POWER4 to support two threads 120 integer and
floating-point registers (POWER4 has 80 integer
and 72 floating-point registers) improves
single-thread performance compared to POWER4
smaller technology (0.13 µm) made it possible to
access a bigger register file in same or shorter
time leading to same pipeline as POWER4
Doubled associativity of L1 caches to reduce
conflict misses icache is 2-way and dcache is
4-way

32
IBM POWER5
Reproduced from IEEE Micro
33
IBM POWER5

Thread priority
Software can set priority of a thread and the
hardware (essentially the decoder) reads these
priority registers to decide which thread to
process in a given cycle
Higher priority thread gets more decode cycles in
the long run i.e. injects more instructions into
the pipe
Eight priority levels for each thread level 0
means idle
Real time tasks get higher priority while a
thread looping on a spin-lock will get lower
priority
Level 1 is the lowest priority for an active
thread if both threads are running at level 1
the processor throttles the overall decode rate
to save dynamic power

34
IBM POWER5

Adaptive resource balancing
Mainly three hardware mechanisms used by POWER5
to make sure that one thread is not hogging too
much
If one thread is found to consume too many GCT
entries i.e. has too many in-flight instructions
(one GCT entry is at most 5 instructions), that
thread will get less decode cycles until GCT
occupancy reaches a balanced state (note the
difference with ICOUNT)
If a thread has too many outstanding L2 cache
misses, that thread will be given less decode
cycles (why?)
If a thread is executing a sync, all instructions
belonging to that thread that are waiting in the
pipe at the dispatch stage will be flushed and
fetching from that thread will be inhibited until
sync finishes (why?)

35
IBM POWER5

Dynamic power management
With SMT and CMP average number of switching per
cycle increases leading to more power consumption
Need to reduce power consumption without losing
performance simple solution is to clock it at a
slower frequency, but that hurts performance
POWER5 employs fine-grain clock-gating in every
cycle the power management logic decides if a
certain latch will be used in the next cycle if
not, it disables or gates the clock for that
latch so that it will not unnecessarily switch in
the next cycle
Clock-gating and power management logic
themselves should be very simple
If both threads are running at priority level 1,
the processor switches to a low power mode where
it dispatches instructions at a much slower pace

36
POWER5 die photo
37
Intel Montecito
38
Features

Dual core Itanium 2, each core dual threaded
1.7 billion transistors, 21.5 mm x 27.7 mm die
27 MB of on-chip three levels of cache
Not shared among cores
1.8 GHz, 100 W
Single-thread enhancements
Extra shifter improves performance of crypto
codes by 100
Improved branch prediction
Improved data and control speculation recovery
Separate L2 instruction and data caches buys 7
improvement over Itanium2 four times bigger L2I
(1 MB)
Asynchronous 12 MB L3 cache

39
Overview
Reproduced from IEEE Micro
40
Dual threads

SMT only for cache, not for core resources
Simulations showed high resource utilization at
core level, but low utilization of cache
Branch predictor is still shared but use thread
id tags
Thread switch is implemented by flushing the pipe
More like coarse-grain multithreading
Five thread switch events
L3 cache miss (immense impact on in-order pipe)/
L3 cache refill
Quantum expiry
Spin lock/ ALAT invalidation
Software-directed switch
Execution in low power mode

41
Thread urgency

Each thread has eight urgency levels
Every L3 miss decrements urgency by one
Every L3 refill increments urgency by one until
urgency reaches 5
A switch due to time quantum expiry sets the
urgency of the switched thread to 7
Arrival of asynchronous interrupt for a
background thread sets the urgency level of that
thread to 6
Switch from L3 miss requires urgency level to be
compared also

42
Thread urgency
Reproduced from IEEE Micro
43
Core arbiter
Reproduced from IEEE Micro
44
Power efficiency

Foxton technology
Blind replication of Itanium 2 cores at 90 nm
would lead to roughly 300 W peak power
consumption (Itanium 2 consumes 130 W peak at 130
nm)
In case of lower than the ceiling power
consumption, the voltage is increased leading to
higher frequency and performance
10 boost for enterprise applications
Software or OS can also dictate a frequency
change if power saving is required
100 ms response time for the feedback loop
Frequency control is achieved by 24 voltage
sensors distributed across the chip the entire
chip runs at a single frequency (other than
asynchronous L3)
Clock gating found limited application in
Montecito

45
Foxton technology
Reproduced from IEEE Micro

Embedded microcontroller runs a real-time
scheduler to execute various tasks

46
Die photo
47
Sun NiagaraORUltrasparc T1
48
Features

Eight pipelines or cores, each shared by 4
threads
32-way multithreading on a single chip
Starting frequency of 1.2 GHz, consumes 60 W
Shared 3 MB L2 cache, 4-way banked, 12-way set
associative, 200 GB/s bandwidth
Single-issue six stage pipe
Target market is web service where ILP is
limited, but TLP is huge (independent
transactions)
Throughput matters

49
Pipeline details
Reproduced from IEEE Micro
50
Pipeline details

Four threads share a six-stage pipeline
Shared L1 caches and TLBs
Dedicated register file per thread
Fetches two instructions every cycle from a
selected thread
Thread select logic also determines which
threads instruction should be fed into the pipe
Although pipe is in-order, there is an 8-entry
store buffer per thread (why?)
Instructions come with predecoded bits to
facilitate thread selection
Threads may run into structural hazards due to
limited number of FUs
Divider is granted to the least recently executed
thread

51
Cache hierarchy

L1 instruction cache
16 KB / 4-way / 32 bytes / random replacement
Fetches two instructions every cycle
If both instructions are useful, next cycle is
free for icache refill
L1 data cache
8 KB / 4-way / 16 bytes/ write-through,
no-allocate
On avearge 10 miss rate for target benchmarks
L2 cache extends the tag to maintain a directory
for keeping the core L1s coherent
L2 cache is writeback with silent clean eviction