Dynamically Trading Frequency for Complexity in a GALS Microprocessor - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Description:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor ... SimpleScalar and Cacti. 40 benchmarks from SPEC, Mediabench, and Olden ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 25

Provided by: davidha91

Learn more at: https://microarch.org

Category:

more less

Transcript and Presenter's Notes

Title: Dynamically Trading Frequency for Complexity in a GALS Microprocessor

1
Dynamically Trading Frequency for Complexity in a
GALS Microprocessor

Steven Dropsho, Greg Semeraro, David H. Albonesi,
Grigorios Magklis, Michael L. Scott
University of Rochester

2
The gist of the paper
Radical idea Trade off frequency and hardware
complexity dynamically at runtime rather than
statically at design time
The new twist A Globally-Asynchronous,
Locally-Synchronous (GALS) microarchitecture is
key to making this worthwhile
3
Application phase behavior

Varying behavior over time

Can exploit to save power

L2 misses
E per interval
L1I misses
L1D misses
branch mispred
IPC
gcc
adaptive issue queue
Sherwood, Sair, Calder, ISCA 2003
Buyuktosunoglu, et al., GLSVLSI 2001
4
What about performance?
RAM delay
entries
relative delay
32 24 16 8
1.0 0.77 0.52 0.31
CAM delay
entries
relative delay
32 24 26 8
1.0 0.77 0.55 0.34
Lower power and faster access time!
Buyuktosunoglu, GLSVLSI 2001
5
What about performance?
How do we exploit the faster speed?
Variable latency
Increase frequency when downsizing
Decrease frequency when upsizing
6
What about performance?
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Dispatch, Rename, ROB
L2 Cache
Issue Queue
Issue Queue
Ld/St Unit
FP
integer
ALUs RF
L1 D-Cache
ALUs RF
clock
Albonesi, ISCA 1998
7
What about performance?
Albonesi, ISCA 1998
8
Enter GALS
Front-end Domain
External Domain
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
Integer Domain
FP Domain
Issue Queue
Issue Queue
Ld/St Unit
L1 D-Cache
ALUs RF
ALUs RF
Semeraro et al., HPCA 2002
Iyer and Marculescu, ISCA 2002
9
Outline

Motivation and background
Adaptive GALS microarchitecture
Control mechanisms
Evaluation methodology
Results
Conclusions and future work

10
Adaptive GALS microarchitecture
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
11
Adaptive GALS operation
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
12
Resizable cache organization

Access A part first, then B part on a miss
Swap A and B blocks on a A miss, B hit
Select A/B split according to application phase
behavior

13
Resizable cache control
MRU State

Config A1 B3
hitsA MRU0
hitsB MRU1 2 3

(LRU)
(MRU)
1
2
3
0
MRU1
A
B
C
D

Config A2 B2
hitsA MRU0 1
hitsB MRU2 3

MRU2
Example Accesses
A
B
C
D

Config A3 B1
hitsA MRU0 1 2
hitsB MRU3

MRU0
B
C
A
D

Config A4 B0
hitsA MRU0 1 2 3
hitsB 0

MRU3
B
C
A
D

Calculate the cost for each possible
configuration

A access costs (hitsA hitsB misses)
CostA B access costs (hitsB misses)
CostB Miss access costs misses CostMiss
Total access cost A B Miss (normalized to
frequency)
14
Resizable issue queue control

Measures the exploitable ILP for each queue size
Timestamp counter is reset at the start of an
interval and incremented each cycle
During rename, a destination register is given a
timestamp based on the timestamp execution
latency of its slowest source operand
The maximum timestamp, MAXN is maintained for
each of the four possible queue sizes over N
fetched instructions (N16, 32, 48, 64)
ILP is estimated as N/MAXN
Queue size with highest ILP (normalized to
frequency) is selected

Read the paper
15
Resizable hardware some details

Front end domain
Icache A 16KB 1-way, 32KB 2-way, 48KB 3-way,
64KB 4-way
Branch predictor sized with Icache
gshare PHT 16KB-64KB
Local BHT 2KB-8KB
Local PHT 1024 entries
Meta 16KB-64KB
Load/store domain
Dcache A 32KB 1-way, 64KB 2-way, 128KB 4-way,
256KB, 8-way
L2 cache A sized with Dcache
256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way
Integer and floating point domains
Issue queue 16, 32, 48, or 64 entries

16
Evaluation methodology

SimpleScalar and Cacti
40 benchmarks from SPEC, Mediabench, and Olden
Baseline best overall performing fully
synchronous 21264-like design found out of 1,024
simulated options
Adaptive MCD costs imposed
Additional branch penalty of 2 integer domain
cycles and 1 front end domain cycle
(overpipelined)
Frequency penalty as much as 31
Mean PLL locking time of 15 µsec
Program-Adaptive profile application and pick
the best adaptive configuration for the whole
program
Phase-Adaptive use online cache and issue queue
control mechanisms

17
Performance improvement
Mediabench
Olden
SPEC
18
Phase behavior art
issue queue entries
100 million instruction window
19
Phase behavior apsi
256KB
128KB
Dcache A size
64KB
32KB
100 million instruction window
20
Performance summary

Program Adaptive 17 performance improvement
Phase Adaptive 20 performance improvement
Automatic
Never degrades performance for 40 applications
Few phases in chosen application windows could
perhaps do better
Distribution of chosen configurations for Program
Adaptive

Integer IQ
FP IQ
D/L2 Cache
Icache
16 85 32 5 48 5 64 5
32KB/256KB 50 64KB/512KB 18 128KB/1MB 23 256KB/
2MB 10
16KB 55 32KB 18 48KB 8 64KB 20
16 73 32 15 48 8 64 5
21
Domain frequency versus IQ size
22
Conclusions

Application phase behavior can be exploited to
improve performance in addition to power savings
GALS approach is key to localizing the impact of
slowing the clock
Cache and queue control mechanisms can evaluate
all possible configurations within a single
interval
Phase adaptive approach improves performance by
as much as 48 and by an average of 20

23
Future work

Explore multiple adaptive structures in each
domain
Better take into account the branch predictor
Resize the instruction cache by sets rather than
ways
Explore better issue queue design alternatives
Build circuits
Dynamically customized heterogeneous multi-core
architectures using phase-adaptive GALS cores

24
Dynamically Trading Frequency for Complexity in a
GALS Microprocessor