Dynamically Trading Frequency for Complexity in a GALS Microprocessor - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Description:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor ... SimpleScalar and Cacti. 40 benchmarks from SPEC, Mediabench, and Olden ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 25
Provided by: davidha91
Learn more at: https://microarch.org
Category:

less

Transcript and Presenter's Notes

Title: Dynamically Trading Frequency for Complexity in a GALS Microprocessor


1
Dynamically Trading Frequency for Complexity in a
GALS Microprocessor
  • Steven Dropsho, Greg Semeraro, David H. Albonesi,
    Grigorios Magklis, Michael L. Scott
  • University of Rochester

2
The gist of the paper
Radical idea Trade off frequency and hardware
complexity dynamically at runtime rather than
statically at design time
The new twist A Globally-Asynchronous,
Locally-Synchronous (GALS) microarchitecture is
key to making this worthwhile
3
Application phase behavior
  • Varying behavior over time
  • Can exploit to save power

L2 misses
E per interval
L1I misses
L1D misses
branch mispred
IPC
gcc
adaptive issue queue
Sherwood, Sair, Calder, ISCA 2003
Buyuktosunoglu, et al., GLSVLSI 2001
4
What about performance?
RAM delay
entries
relative delay
32 24 16 8
1.0 0.77 0.52 0.31
CAM delay
entries
relative delay
32 24 26 8
1.0 0.77 0.55 0.34
Lower power and faster access time!
Buyuktosunoglu, GLSVLSI 2001
5
What about performance?
How do we exploit the faster speed?
Variable latency
Increase frequency when downsizing
Decrease frequency when upsizing
6
What about performance?
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Dispatch, Rename, ROB
L2 Cache
Issue Queue
Issue Queue
Ld/St Unit
FP
integer
ALUs RF
L1 D-Cache
ALUs RF
clock
Albonesi, ISCA 1998
7
What about performance?
Albonesi, ISCA 1998
8
Enter GALS
Front-end Domain
External Domain
L1 I-Cache
Main Memory
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
Integer Domain
FP Domain
Issue Queue
Issue Queue
Ld/St Unit
L1 D-Cache
ALUs RF
ALUs RF
Semeraro et al., HPCA 2002
Iyer and Marculescu, ISCA 2002
9
Outline
  • Motivation and background
  • Adaptive GALS microarchitecture
  • Control mechanisms
  • Evaluation methodology
  • Results
  • Conclusions and future work

10
Adaptive GALS microarchitecture
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
11
Adaptive GALS operation
Front-end Domain
External Domain
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
L1 I-Cache
Br Pred
Br Pred
Br Pred
Br Pred
Fetch Unit
Br Pred
Memory Domain
Dispatch, Rename, ROB
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Integer Domain
FP Domain
Ld/St Unit
Issue Queue
Issue Queue
Issue Queue
Issue Queue
Issue Queue
L1 D-Cache
L1 D-Cache
L1 D-Cache
L1 D-Cache
ALUs RF
ALUs RF
12
Resizable cache organization
  • Access A part first, then B part on a miss
  • Swap A and B blocks on a A miss, B hit
  • Select A/B split according to application phase
    behavior

13
Resizable cache control
MRU State
  • Config A1 B3
  • hitsA MRU0
  • hitsB MRU1 2 3

(LRU)
(MRU)
1
2
3
0
MRU1
A
B
C
D
  • Config A2 B2
  • hitsA MRU0 1
  • hitsB MRU2 3

MRU2
Example Accesses
A
B
C
D
  • Config A3 B1
  • hitsA MRU0 1 2
  • hitsB MRU3

MRU0
B
C
A
D
  • Config A4 B0
  • hitsA MRU0 1 2 3
  • hitsB 0

MRU3
B
C
A
D
  • Calculate the cost for each possible
    configuration

A access costs (hitsA hitsB misses)
CostA B access costs (hitsB misses)
CostB Miss access costs misses CostMiss
Total access cost A B Miss (normalized to
frequency)
14
Resizable issue queue control
  • Measures the exploitable ILP for each queue size
  • Timestamp counter is reset at the start of an
    interval and incremented each cycle
  • During rename, a destination register is given a
    timestamp based on the timestamp execution
    latency of its slowest source operand
  • The maximum timestamp, MAXN is maintained for
    each of the four possible queue sizes over N
    fetched instructions (N16, 32, 48, 64)
  • ILP is estimated as N/MAXN
  • Queue size with highest ILP (normalized to
    frequency) is selected

Read the paper
15
Resizable hardware some details
  • Front end domain
  • Icache A 16KB 1-way, 32KB 2-way, 48KB 3-way,
    64KB 4-way
  • Branch predictor sized with Icache
  • gshare PHT 16KB-64KB
  • Local BHT 2KB-8KB
  • Local PHT 1024 entries
  • Meta 16KB-64KB
  • Load/store domain
  • Dcache A 32KB 1-way, 64KB 2-way, 128KB 4-way,
    256KB, 8-way
  • L2 cache A sized with Dcache
  • 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way
  • Integer and floating point domains
  • Issue queue 16, 32, 48, or 64 entries

16
Evaluation methodology
  • SimpleScalar and Cacti
  • 40 benchmarks from SPEC, Mediabench, and Olden
  • Baseline best overall performing fully
    synchronous 21264-like design found out of 1,024
    simulated options
  • Adaptive MCD costs imposed
  • Additional branch penalty of 2 integer domain
    cycles and 1 front end domain cycle
    (overpipelined)
  • Frequency penalty as much as 31
  • Mean PLL locking time of 15 µsec
  • Program-Adaptive profile application and pick
    the best adaptive configuration for the whole
    program
  • Phase-Adaptive use online cache and issue queue
    control mechanisms

17
Performance improvement
Mediabench
Olden
SPEC
18
Phase behavior art
issue queue entries
100 million instruction window
19
Phase behavior apsi
256KB
128KB
Dcache A size
64KB
32KB
100 million instruction window
20
Performance summary
  • Program Adaptive 17 performance improvement
  • Phase Adaptive 20 performance improvement
  • Automatic
  • Never degrades performance for 40 applications
  • Few phases in chosen application windows could
    perhaps do better
  • Distribution of chosen configurations for Program
    Adaptive

Integer IQ
FP IQ
D/L2 Cache
Icache
16 85 32 5 48 5 64 5
32KB/256KB 50 64KB/512KB 18 128KB/1MB 23 256KB/
2MB 10
16KB 55 32KB 18 48KB 8 64KB 20
16 73 32 15 48 8 64 5
21
Domain frequency versus IQ size
22
Conclusions
  • Application phase behavior can be exploited to
    improve performance in addition to power savings
  • GALS approach is key to localizing the impact of
    slowing the clock
  • Cache and queue control mechanisms can evaluate
    all possible configurations within a single
    interval
  • Phase adaptive approach improves performance by
    as much as 48 and by an average of 20

23
Future work
  • Explore multiple adaptive structures in each
    domain
  • Better take into account the branch predictor
  • Resize the instruction cache by sets rather than
    ways
  • Explore better issue queue design alternatives
  • Build circuits
  • Dynamically customized heterogeneous multi-core
    architectures using phase-adaptive GALS cores

24
Dynamically Trading Frequency for Complexity in a
GALS Microprocessor
  • Steven Dropsho, Greg Semeraro, David H. Albonesi,
    Grigorios Magklis, Michael L. Scott
  • University of Rochester
Write a Comment
User Comments (0)
About PowerShow.com