Energy Awareness and Uncertainty in Design at Microarchitectural Level - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Energy Awareness and Uncertainty in Design at Microarchitectural Level

Description:

Diana Marculescu. Dept. of Electrical and Computer Engineering ... 2005 Diana Marculescu. Austin Conference on Energy Efficient Design - March 1, 2005 ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 40

Provided by: ece158

Category:

more less

Transcript and Presenter's Notes

Title: Energy Awareness and Uncertainty in Design at Microarchitectural Level

1
Energy Awareness and Uncertainty in Design at
Microarchitectural Level
Title Goes Here

Diana Marculescu
Dept. of Electrical and Computer Engineering
Carnegie Mellon University
dianam_at_ece.cmu.edu
http//www.ece.cmu.edu/enyac

2
Why energy aware?
Courtesy of Deo Singh, Intel Cool Chips Tutorial,
MICRO-32

Moderate performance improvements come at
significant power costs
Wide variation of
Application behavior
User profile
Environment characteristics
within and across applications

3
The Road Towards Energy Awareness

Need for synergistic approaches that tie
Lower level technology capabilities such as
supply and threshold voltage or speed scaling
with knobs available at higher levels of
abstraction (microarchitecture, architecture,
system level)
Need for coping with design complexity issues
through localized, fine-grain power management
Possible solution the use of voltage/frequency
islands that may run at lower speed/power for a
prescribed workload

4
The Road Towards Energy Awareness

Need for synergistic approaches that tie
Lower level technology capabilities such as
supply and threshold voltage or speed scaling
with knobs available at higher levels of
abstraction (microarchitecture, architecture,
system level)
Need for coping with design complexity issues,
localized, fine-grain power management
Possible solution the use of voltage/frequency
islands that may run at lower speed/power for a
prescribed workload

One possible solution Globally Asynchronous,
Locally Synchronous processing
5
Outline

Why energy awareness via GALS?
Impact of variability in design
Joint energy and variability metrics
Case study
Globally Asynchronous, Locally Synchronous (GALS)
vs. fully synchronous architectures
Ahead

6
Why energy aware via GALS?
uarch variability

Many approaches targeting energy awareness may
not be complexity effective
And thus, may negatively affect indirectly
overall variability
Our claim is that GALS designs may enable lower
uarchitecture-driven variability

7
Energy Variability interactions
8
Faster clock speeds ? Less variability

Deeper pipelining worsens random variation impact
Total variation impact insensitive to pipeline
depth
Variations worsen with increasing number of
critical paths
Most performance enhancing techniques increase
number of critical paths the simplest
(superpipelining) increases overall random
variations

Source Shekhar Borkar, Intel, 2004
9
Energy awareness ? Less variability

Deeper pipelining worsens random variation impact
Total variation impact insensitive to pipeline
depth
Variations worsen with increasing number of
critical paths
Many techniques for achieving application-driven
energy awareness increase variability selective
voltage scaling adaptive resource scaling, etc.
e.g., they exploit existing slack for hiding
voltage scaling latencies

Source Shekhar Borkar, Intel, 2004
10
If Energy is the question, is GALS the answer?
Synchronous
GALS

Potential for fine-grain adaptability
Different speeds among synchronous blocks
Different voltages, and hence potential for lower
power consumption
Can be used with on-the-fly application-driven
adaptation

11
If Variability is the question, is GALS the
answer?
Synchronous
GALS

Potential for less process or system
parameterinduced variability
Local clock domains are characterized by tighter
static or dynamic variations
Enable faster local speeds AND better overall
energy efficiency

12
More on GALS systems
Title Goes Here
13
GALS design issues

Metastability resolution
Always a problem when interfacing clock domains
Synchronizers and arbiters
Possible solution mixed-timing FIFOs Chelcea et
al., DAC 2001
Local clock generation
Ring oscillators Muttersbach et al., ASYNC 2000
Failure modeling
Synchronization failures can happen
The goal is to maximize Mean Time Between Failure
(MTBF)

14
GALS problems of interest

Granularity of clock domain partitioning
How many clock domains?
Architecture definition where to decouple?
Fine-grain, dynamic control strategy for
adjusting the voltage/clock speed
Occupancy-based, threshold control algorithms
Attack-decay control algorithms
Plus using cross-domain dependency information

15
How many frequency islands?
16
How many frequency islands?
17
How many frequency islands?
18
Architecture definition where to decouple?

Automatic partitioning into clock domains Hemani
et al., DAC 1999
Applied only to random logic, not helpful for
high-end processors
A typical superscalar core exhibits natural
decoupling
Front-end (I-cache and BP hardware)
Decode, Register renaming
Integer, FP and memory domains
Local clock grids usually follow the same type of
partitioning

19
Performance increase coefficient

Significant speed-up can be achieved increasing
clock speed in the Fetch or Memory, followed by
Integer and FP partitions

20
How many frequency islands?

Performance penalty increases with the number of
asynchronous interfaces

21
Impact on energy reduction

Due to the increase in execution time, total
energy per task required by GALS processors can
actually increase
For more than 5 clock domains, a GALS design is
no longer cost-effective

22
Energy and performance trends

Average power decreases, but overall energy may
increase
Reasons
Performance drops due to asynchronous
communication
Hence longer execution time and more overhead for
unused modules
Longer branch recovery pipeline
Consequently, more speculative execution
Higher fetch to commit time for each instruction
Higher occupancies of rename tables and issue
queues

23
Where Does Energy Go?

Longer execution times translate into larger
energy costs

24
Delay - voltage dependency

Fine-grain voltage reduction can be beneficial in
GALS systems
Each clock domain can be run at a different speed
Vdd is the supply voltage
Vt is the threshold voltage
a is a technology-defined constant between 1 and
2
a is 1.2-1.6 for present generation Chen et al.,
1998

25
A possible solution Dynamic control strategy

Threshold-based algorithm Iyer et al., 2002
Assumes two operating modes
Selects the appropriate mode based on the Issue
Queue occupancy
Attack-decay algorithm Semeraro et al., 2002
Assumes several (tens) operating modes
Tries to preserve the same Issue Queue occupancy,
while more aggressively pursuing the best
power/performance trade-off
Additional improvements Talpes et al., 2003
Fetch clock speed scaled to match commit rates
Use cross-domain dependency information to
eliminate false positives in low occupancy rates

26
Impact on energy reduction Talpes et al., 2003

By using DVS, an average energy reduction of up
to 25 can be achieved at the expense of a 10
penalty in performance
Note that these are pessimistic results
variability is not taken into account

27
Joint Variability-Energy Characterization
28
GALS and variability

Our claim GALS (or frequency island-based)
design allows for less process-induced
variability
While globally enabling better performance
and/or better power/performance trade-offs
Intuitive observation on the role of uarch
decisions
The number of critical paths per clock domain is
smaller
Hence, less variability
Beneficial impact on other parameters as well
Will include smaller variations in temperature
per clock domain

29
Variability modeling Bowman et al., 2002

Assume normal distributions for critical path
delay (Tcp,nom nominal critical path delay)
Maximum critical path delay distribution (f
probability density, F cumulative probability
function, Ncp number of critical paths)

30
Putting energy, performance and variability
together

A possible probabilistic design metric that needs
to be maximized (FMAX clock speed distribution)
Borkar, 2004
However, in the case of high-end processors
Clock speed does not necessarily translate into
performance
Moreover, IPC increasing artifacts affect
variability
Proposed joint metric that must be minimized
The goal is to include variability in the maximum
critical path or minimum clock speed, with and
without temperature modeling (q temperature,
ncp logic depth) Basu et al., 2004

31
Modeling details

Assume only WID effects
WID variations mostly affect the mean of FMAX
distribution
D2D affects the spread
Concerned only with
Static gate length variations
Dynamic temperature variations
Here we do not look at the impact on leakage
variability
Use device counts per module/clock domain to
estimate
Total die area or clock domain area
Number of critical paths Ncp per die or clock
domain

32
Microarchitecture settings

Pipeline 16 stages, 4 way out-of-order
Instruction Window 64 entries - 32 Int, 16 FP, 16
Mem
Load / Store Queue 32 entries
I-Cache 32K, 2 way set-associative, 1 cycle hit
time, LRU replacement
D-Cache 32K, 4 way set-associative, 2 cycles hit
time, LRU replacement
L2 Cache Unified, 256K, 4 way set-associative,
LRU replacement
Access time 10 cycles
Memory access time 100 cycles
Functional Units 4 Integer ALUs, 2 Integer
MUL/DI
2 Memory ports
2 FP Adders, 1 FP MUL/DIV
Branch Prediction G-share, 11 bits history, 2048
entries
Technology 0.13 um STMicro technology (high
speed)
Vdd 1.8V, Vt 0.2V
Normalized leakage 80 nA
current per device 1
Clock Speed / Vdd 250MHz - 1000MHz, 0.7V - 1.8V
DVS Thresholds Integer - 9, 12 Memory - 9,
12 FP - 6, 9
DVS Speed levels Integer - High 1GHz, Low
750MHz

33
Case study GALS vs. fully synchronous processors
34
Critical path delay distribution without Temp

Locally clocked domains have a mean value for the
maximum critical path delay that is 2-12 smaller
than for the fully synchronous baseline

35
Critical path delay distribution with Temp

Locally clocked domains have a mean value for the
maximum critical path delay that is 8-18 smaller
than for the fully synchronous baseline

36
Q metric distribution with and without Temp

Using local speed/voltage scaling per clock
domain decreases Q by 26 when compared to the
synchronous baseline

37
Q metric probability with and without Temp

GALS-T-DVS eliminates most of the high-Q bin when
compared to the synchronous baseline

38
Instead of summary

Microarchitectural modeling of process
variability effects is possible
In conjunction with fine-grain DVS,
minimally-clocked machines provide a better joint
energy/performance/variability metric than their
fully synchronous counterparts
Considered only WID-induced gate length effects
and temperature-induced effects
Ahead
both WID and D2D variability
leakage variations
true microarchitecture design exploration with
variability in mind

39
Thank you!