Energy Awareness and Uncertainty in Design at Microarchitectural Level - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Energy Awareness and Uncertainty in Design at Microarchitectural Level

Description:

Diana Marculescu. Dept. of Electrical and Computer Engineering ... 2005 Diana Marculescu. Austin Conference on Energy Efficient Design - March 1, 2005 ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 40
Provided by: ece158
Category:

less

Transcript and Presenter's Notes

Title: Energy Awareness and Uncertainty in Design at Microarchitectural Level


1
Energy Awareness and Uncertainty in Design at
Microarchitectural Level
Title Goes Here
  • Diana Marculescu
  • Dept. of Electrical and Computer Engineering
  • Carnegie Mellon University
  • dianam_at_ece.cmu.edu
  • http//www.ece.cmu.edu/enyac

2
Why energy aware?
Courtesy of Deo Singh, Intel Cool Chips Tutorial,
MICRO-32
  • Moderate performance improvements come at
    significant power costs
  • Wide variation of
  • Application behavior
  • User profile
  • Environment characteristics
  • within and across applications

3
The Road Towards Energy Awareness
  • Need for synergistic approaches that tie
  • Lower level technology capabilities such as
    supply and threshold voltage or speed scaling
  • with knobs available at higher levels of
    abstraction (microarchitecture, architecture,
    system level)
  • Need for coping with design complexity issues
    through localized, fine-grain power management
  • Possible solution the use of voltage/frequency
    islands that may run at lower speed/power for a
    prescribed workload

4
The Road Towards Energy Awareness
  • Need for synergistic approaches that tie
  • Lower level technology capabilities such as
    supply and threshold voltage or speed scaling
  • with knobs available at higher levels of
    abstraction (microarchitecture, architecture,
    system level)
  • Need for coping with design complexity issues,
    localized, fine-grain power management
  • Possible solution the use of voltage/frequency
    islands that may run at lower speed/power for a
    prescribed workload

One possible solution Globally Asynchronous,
Locally Synchronous processing
5
Outline
  • Why energy awareness via GALS?
  • Impact of variability in design
  • Joint energy and variability metrics
  • Case study
  • Globally Asynchronous, Locally Synchronous (GALS)
    vs. fully synchronous architectures
  • Ahead

6
Why energy aware via GALS?
uarch variability
  • Many approaches targeting energy awareness may
    not be complexity effective
  • And thus, may negatively affect indirectly
    overall variability
  • Our claim is that GALS designs may enable lower
    uarchitecture-driven variability

7
Energy Variability interactions
8
Faster clock speeds ? Less variability
  • Deeper pipelining worsens random variation impact
  • Total variation impact insensitive to pipeline
    depth
  • Variations worsen with increasing number of
    critical paths
  • Most performance enhancing techniques increase
    number of critical paths the simplest
    (superpipelining) increases overall random
    variations

Source Shekhar Borkar, Intel, 2004
9
Energy awareness ? Less variability
  • Deeper pipelining worsens random variation impact
  • Total variation impact insensitive to pipeline
    depth
  • Variations worsen with increasing number of
    critical paths
  • Many techniques for achieving application-driven
    energy awareness increase variability selective
    voltage scaling adaptive resource scaling, etc.
    e.g., they exploit existing slack for hiding
    voltage scaling latencies

Source Shekhar Borkar, Intel, 2004
10
If Energy is the question, is GALS the answer?
Synchronous
GALS
  • Potential for fine-grain adaptability
  • Different speeds among synchronous blocks
  • Different voltages, and hence potential for lower
    power consumption
  • Can be used with on-the-fly application-driven
    adaptation

11
If Variability is the question, is GALS the
answer?
Synchronous
GALS
  • Potential for less process or system
    parameterinduced variability
  • Local clock domains are characterized by tighter
    static or dynamic variations
  • Enable faster local speeds AND better overall
    energy efficiency

12
More on GALS systems
Title Goes Here
13
GALS design issues
  • Metastability resolution
  • Always a problem when interfacing clock domains
  • Synchronizers and arbiters
  • Possible solution mixed-timing FIFOs Chelcea et
    al., DAC 2001
  • Local clock generation
  • Ring oscillators Muttersbach et al., ASYNC 2000
  • Failure modeling
  • Synchronization failures can happen
  • The goal is to maximize Mean Time Between Failure
    (MTBF)

14
GALS problems of interest
  • Granularity of clock domain partitioning
  • How many clock domains?
  • Architecture definition where to decouple?
  • Fine-grain, dynamic control strategy for
    adjusting the voltage/clock speed
  • Occupancy-based, threshold control algorithms
  • Attack-decay control algorithms
  • Plus using cross-domain dependency information

15
How many frequency islands?
16
How many frequency islands?
17
How many frequency islands?
18
Architecture definition where to decouple?
  • Automatic partitioning into clock domains Hemani
    et al., DAC 1999
  • Applied only to random logic, not helpful for
    high-end processors
  • A typical superscalar core exhibits natural
    decoupling
  • Front-end (I-cache and BP hardware)
  • Decode, Register renaming
  • Integer, FP and memory domains
  • Local clock grids usually follow the same type of
    partitioning

19
Performance increase coefficient
  • Significant speed-up can be achieved increasing
    clock speed in the Fetch or Memory, followed by
    Integer and FP partitions

20
How many frequency islands?
  • Performance penalty increases with the number of
    asynchronous interfaces

21
Impact on energy reduction
  • Due to the increase in execution time, total
    energy per task required by GALS processors can
    actually increase
  • For more than 5 clock domains, a GALS design is
    no longer cost-effective

22
Energy and performance trends
  • Average power decreases, but overall energy may
    increase
  • Reasons
  • Performance drops due to asynchronous
    communication
  • Hence longer execution time and more overhead for
    unused modules
  • Longer branch recovery pipeline
  • Consequently, more speculative execution
  • Higher fetch to commit time for each instruction
  • Higher occupancies of rename tables and issue
    queues

23
Where Does Energy Go?
  • Longer execution times translate into larger
    energy costs

24
Delay - voltage dependency
  • Fine-grain voltage reduction can be beneficial in
    GALS systems
  • Each clock domain can be run at a different speed
  • Vdd is the supply voltage
  • Vt is the threshold voltage
  • a is a technology-defined constant between 1 and
    2
  • a is 1.2-1.6 for present generation Chen et al.,
    1998

25
A possible solution Dynamic control strategy
  • Threshold-based algorithm Iyer et al., 2002
  • Assumes two operating modes
  • Selects the appropriate mode based on the Issue
    Queue occupancy
  • Attack-decay algorithm Semeraro et al., 2002
  • Assumes several (tens) operating modes
  • Tries to preserve the same Issue Queue occupancy,
    while more aggressively pursuing the best
    power/performance trade-off
  • Additional improvements Talpes et al., 2003
  • Fetch clock speed scaled to match commit rates
  • Use cross-domain dependency information to
    eliminate false positives in low occupancy rates

26
Impact on energy reduction Talpes et al., 2003
  • By using DVS, an average energy reduction of up
    to 25 can be achieved at the expense of a 10
    penalty in performance
  • Note that these are pessimistic results
    variability is not taken into account

27
Joint Variability-Energy Characterization
28
GALS and variability
  • Our claim GALS (or frequency island-based)
    design allows for less process-induced
    variability
  • While globally enabling better performance
    and/or better power/performance trade-offs
  • Intuitive observation on the role of uarch
    decisions
  • The number of critical paths per clock domain is
    smaller
  • Hence, less variability
  • Beneficial impact on other parameters as well
  • Will include smaller variations in temperature
    per clock domain

29
Variability modeling Bowman et al., 2002
  • Assume normal distributions for critical path
    delay (Tcp,nom nominal critical path delay)
  • Maximum critical path delay distribution (f
    probability density, F cumulative probability
    function, Ncp number of critical paths)

30
Putting energy, performance and variability
together
  • A possible probabilistic design metric that needs
    to be maximized (FMAX clock speed distribution)
    Borkar, 2004
  • However, in the case of high-end processors
  • Clock speed does not necessarily translate into
    performance
  • Moreover, IPC increasing artifacts affect
    variability
  • Proposed joint metric that must be minimized
  • The goal is to include variability in the maximum
    critical path or minimum clock speed, with and
    without temperature modeling (q temperature,
    ncp logic depth) Basu et al., 2004

31
Modeling details
  • Assume only WID effects
  • WID variations mostly affect the mean of FMAX
    distribution
  • D2D affects the spread
  • Concerned only with
  • Static gate length variations
  • Dynamic temperature variations
  • Here we do not look at the impact on leakage
    variability
  • Use device counts per module/clock domain to
    estimate
  • Total die area or clock domain area
  • Number of critical paths Ncp per die or clock
    domain

32
Microarchitecture settings
  • Pipeline 16 stages, 4 way out-of-order
  • Instruction Window 64 entries - 32 Int, 16 FP, 16
    Mem
  • Load / Store Queue 32 entries
  • I-Cache 32K, 2 way set-associative, 1 cycle hit
    time, LRU replacement
  • D-Cache 32K, 4 way set-associative, 2 cycles hit
    time, LRU replacement
  • L2 Cache Unified, 256K, 4 way set-associative,
    LRU replacement
  • Access time 10 cycles
  • Memory access time 100 cycles
  • Functional Units 4 Integer ALUs, 2 Integer
    MUL/DI
  • 2 Memory ports
  • 2 FP Adders, 1 FP MUL/DIV
  • Branch Prediction G-share, 11 bits history, 2048
    entries
  • Technology 0.13 um STMicro technology (high
    speed)
  • Vdd 1.8V, Vt 0.2V
  • Normalized leakage 80 nA
  • current per device 1
  • Clock Speed / Vdd 250MHz - 1000MHz, 0.7V - 1.8V
  • DVS Thresholds Integer - 9, 12 Memory - 9,
    12 FP - 6, 9
  • DVS Speed levels Integer - High 1GHz, Low
    750MHz

33
Case study GALS vs. fully synchronous processors
34
Critical path delay distribution without Temp
  • Locally clocked domains have a mean value for the
    maximum critical path delay that is 2-12 smaller
    than for the fully synchronous baseline

35
Critical path delay distribution with Temp
  • Locally clocked domains have a mean value for the
    maximum critical path delay that is 8-18 smaller
    than for the fully synchronous baseline

36
Q metric distribution with and without Temp
  • Using local speed/voltage scaling per clock
    domain decreases Q by 26 when compared to the
    synchronous baseline

37
Q metric probability with and without Temp
  • GALS-T-DVS eliminates most of the high-Q bin when
    compared to the synchronous baseline

38
Instead of summary
  • Microarchitectural modeling of process
    variability effects is possible
  • In conjunction with fine-grain DVS,
    minimally-clocked machines provide a better joint
    energy/performance/variability metric than their
    fully synchronous counterparts
  • Considered only WID-induced gate length effects
    and temperature-induced effects
  • Ahead
  • both WID and D2D variability
  • leakage variations
  • true microarchitecture design exploration with
    variability in mind

39
Thank you!
  • More information
  • CMUs Energy Aware Computing Group
  • http//www.ece.cmu.edu/enyac
Write a Comment
User Comments (0)
About PowerShow.com