Title: Dynamically Trading Frequency for Complexity in a GALS Microprocessor
1Dynamically Trading Frequency for Complexity in a
GALS Microprocessor
- Steven Dropsho, Greg Semeraro, David H. Albonesi,
Grigorios Magklis, Michael L. Scott - University of Rochester
2The gist of the paper
Radical idea Trade off frequency and hardware
complexity dynamically at runtime rather than
statically at design time
The new twist A Globally-Asynchronous,
Locally-Synchronous (GALS) microarchitecture is
key to making this worthwhile
3Application phase behavior
- Varying behavior over time
- Can exploit to save power
L2 misses
E per interval
L1I misses
L1D misses
branch mispred
IPC
gcc
adaptive issue queue
Sherwood, Sair, Calder, ISCA 2003
Buyuktosunoglu, et al., GLSVLSI 2001
4What about performance?
RAM delay
entries
relative delay
32 24 16 8
1.0 0.77 0.52 0.31
CAM delay
entries
relative delay
32 24 26 8
1.0 0.77 0.55 0.34
Lower power and faster access time!
Buyuktosunoglu, GLSVLSI 2001
5What about performance?
How do we exploit the faster speed?
Variable latency
Increase frequency when downsizing
Decrease frequency when upsizing
6What about performance?
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Dispatch, Rename, ROB
L2 Cache
Issue Queue
Issue Queue
Ld/St Unit
FP
integer
ALUs RF
L1 D-Cache
ALUs RF
clock
Albonesi, ISCA 1998
7What about performance?
Albonesi, ISCA 1998
8Enter GALS
Front-end Domain
External Domain
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
Integer Domain
FP Domain
Issue Queue
Issue Queue
Ld/St Unit
L1 D-Cache
ALUs RF
ALUs RF
Semeraro et al., HPCA 2002
Iyer and Marculescu, ISCA 2002
9Outline
- Motivation and background
- Adaptive GALS microarchitecture
- Control mechanisms
- Evaluation methodology
- Results
- Conclusions and future work
10Adaptive GALS microarchitecture
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
11Adaptive GALS operation
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
12Resizable cache organization
- Access A part first, then B part on a miss
- Swap A and B blocks on a A miss, B hit
- Select A/B split according to application phase
behavior
13Resizable cache control
MRU State
- Config A1 B3
- hitsA MRU0
- hitsB MRU1 2 3
(LRU)
(MRU)
1
2
3
0
MRU1
A
B
C
D
- Config A2 B2
- hitsA MRU0 1
- hitsB MRU2 3
MRU2
Example Accesses
A
B
C
D
- Config A3 B1
- hitsA MRU0 1 2
- hitsB MRU3
MRU0
B
C
A
D
- Config A4 B0
- hitsA MRU0 1 2 3
- hitsB 0
MRU3
B
C
A
D
- Calculate the cost for each possible
configuration
A access costs (hitsA hitsB misses)
CostA B access costs (hitsB misses)
CostB Miss access costs misses CostMiss
Total access cost A B Miss (normalized to
frequency)
14Resizable issue queue control
- Measures the exploitable ILP for each queue size
- Timestamp counter is reset at the start of an
interval and incremented each cycle - During rename, a destination register is given a
timestamp based on the timestamp execution
latency of its slowest source operand - The maximum timestamp, MAXN is maintained for
each of the four possible queue sizes over N
fetched instructions (N16, 32, 48, 64) - ILP is estimated as N/MAXN
- Queue size with highest ILP (normalized to
frequency) is selected
Read the paper
15Resizable hardware some details
- Front end domain
- Icache A 16KB 1-way, 32KB 2-way, 48KB 3-way,
64KB 4-way - Branch predictor sized with Icache
- gshare PHT 16KB-64KB
- Local BHT 2KB-8KB
- Local PHT 1024 entries
- Meta 16KB-64KB
- Load/store domain
- Dcache A 32KB 1-way, 64KB 2-way, 128KB 4-way,
256KB, 8-way - L2 cache A sized with Dcache
- 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way
- Integer and floating point domains
- Issue queue 16, 32, 48, or 64 entries
16Evaluation methodology
- SimpleScalar and Cacti
- 40 benchmarks from SPEC, Mediabench, and Olden
- Baseline best overall performing fully
synchronous 21264-like design found out of 1,024
simulated options - Adaptive MCD costs imposed
- Additional branch penalty of 2 integer domain
cycles and 1 front end domain cycle
(overpipelined) - Frequency penalty as much as 31
- Mean PLL locking time of 15 µsec
- Program-Adaptive profile application and pick
the best adaptive configuration for the whole
program - Phase-Adaptive use online cache and issue queue
control mechanisms
17Performance improvement
Mediabench
Olden
SPEC
18Phase behavior art
issue queue entries
100 million instruction window
19Phase behavior apsi
256KB
128KB
Dcache A size
64KB
32KB
100 million instruction window
20Performance summary
- Program Adaptive 17 performance improvement
- Phase Adaptive 20 performance improvement
- Automatic
- Never degrades performance for 40 applications
- Few phases in chosen application windows could
perhaps do better - Distribution of chosen configurations for Program
Adaptive
Integer IQ
FP IQ
D/L2 Cache
Icache
16 85 32 5 48 5 64 5
32KB/256KB 50 64KB/512KB 18 128KB/1MB 23 256KB/
2MB 10
16KB 55 32KB 18 48KB 8 64KB 20
16 73 32 15 48 8 64 5
21Domain frequency versus IQ size
22Conclusions
- Application phase behavior can be exploited to
improve performance in addition to power savings - GALS approach is key to localizing the impact of
slowing the clock - Cache and queue control mechanisms can evaluate
all possible configurations within a single
interval - Phase adaptive approach improves performance by
as much as 48 and by an average of 20
23Future work
- Explore multiple adaptive structures in each
domain - Better take into account the branch predictor
- Resize the instruction cache by sets rather than
ways - Explore better issue queue design alternatives
- Build circuits
- Dynamically customized heterogeneous multi-core
architectures using phase-adaptive GALS cores
24Dynamically Trading Frequency for Complexity in a
GALS Microprocessor
- Steven Dropsho, Greg Semeraro, David H. Albonesi,
Grigorios Magklis, Michael L. Scott - University of Rochester