SingleISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

SingleISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction

Description:

Clock Gating. Bigger core, No dynamic power, no static power. Little core, ... assumed to be implemented in 0.1micron and clocked at 2.1GHz, input voltage 1.2V ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 35
Provided by: mesl2
Category:

less

Transcript and Presenter's Notes

Title: SingleISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction


1
Single-ISA Heterogeneous Multi-core
Architectures The Potential for Processor Power
Reduction
  • Rakesh Kumar(UCSD)
  • Keith Farkas(HP Labs)
  • Norman Jouppi(HP Labs)
  • Partha Ranganathan(HP Labs)
  • Dean Tullsen(UCSD)

2
Clock Gating
  • Power reduction
  • Gating
  • Voltage/frequency
  • scaling

3
Clock Gating
4
Clock Gating
5
Little core, little power
6
  • But doesnt adding cores waste real estate?

7
It doesbut not much
All cores put together only 15 larger than EV8-
core by itself
8
Single-ISA Heterogeneous Multi-Core Architectures
  • Have multiple heterogeneous cores on the same die
  • Match workload (or workload phase) to core that
    achieves best efficiency according to some
    objective function
  • Matching is achieved by the ability to
    dynamically switch between cores
  • Power down the unused cores completely

9
Outline of Talk
  • Motivation
  • A Heterogeneous Multi-core Architecture
  • Assumptions
  • Decisions
  • Methodology
  • Results
  • Conclusions

10
A Heterogeneous Multi-core Architecture
EV8-
EV6
EV5
EV4
3.5MB, 7 way
All cores assumed to be implemented in 0.1micron
and clocked at 2.1GHz, input voltage 1.2V
11
Properties of the Cores
EV8- consumes 18 times more power than EV4! It is
more than 85 times bigger too!
12
Core-switching
  • Goal - so that the application can be on the best
    core at all points of time best according to
    an objective function
  • Switching enabled by OS-level scheduler
  • If the OS decides a switch is in order
  • Powers up the new core
  • Triggers a cache flush to save dirty lines to L2
  • Signals the new core to start at a
    predefined-point
  • New core powers down the old core and returns
    from the time interrupt handler
  • Hence, only one core powered on at a time

13
Power modeling
  • Power depends on
  • Circuit design style
  • Process Parameters
  • Activity
  • No existing model replicated reported results
  • Developed a hybrid model
  • Used estimates from Wattch for getting
    activity-based dissipation
  • Used Wattch numbers to scale between known values
  • Details in the paper

14
Simulation Methodology
  • Simulator used A multiprocessor derivative of
    SMTSIM
  • And integrated our power model into it
  • Benchmarks used 14 chosen randomly out of
    SPEC2000 suite
  • Fast-forwarded for 2 billion instructions,
    simulated for 1 billion instructions.
  • Finest granularity for switching set to 1 million
    instructions.

15
Outline
  • Motivation
  • A Heterogeneous Multi-core Architecture
  • Assumptions
  • Decisions
  • Methodology
  • Results
  • oracle dynamic
  • realistic dynamic
  • Conclusions

16
Objective Functions
  • Typically a composition of energy and performance
  • Can be fixed, or can change over time or within
    an application simply an OS policy
  • We evaluate ones based on energy and energy-delay
    product (both under performance constraints)

17
Dynamically choosing the Core with Least Energy
(perf. losslt10)
18
Dynamically choosing the Core with Least Energy
(perf. losslt10)
All cores get used
19
Dynamically choosing the Core with Least Energy
(perf. losslt10) Summary of results
  • Number of switches relatively infrequent
  • EV8- and EV6 emerge as the dominant cores

20
Dynamically choosing the Core with Best
Energy-Delay Product (perf. losslt50)
21
Dynamically choosing the Core with Best
Energy-Delay Product (perf. losslt50)
Alternates between EV4 and EV6.
22
Dynamically choosing the Core with Best
Energy-Delay product (perf. losslt50) Summary of
Results
  • Number of switches relatively infrequent
  • EV6 and EV4 emerge as the dominant cores

23
More on the Oracle Results
  • Even for present improvements, beats chip-wide
    voltage scaling handsomely (50.6 ED2
    improvement)
  • Effects that make oracle results unrealistic
  • Core switching policy must be based on historical
    behavior, not on future behavior
  • Cost of switching cores must be accounted for

24
Low core switching ovehead
  • Maximum granularity of switching quite coarse
  • Infrequent switches long phases and dominant
    cores
  • Shared L2cache reduces cold-start effect
  • Also, even though not used in the paper, various
    techniques for reducing switching overhead
    possible..

25
Realistic heuristics for minimizing energy-delay
product
  • neighbor Sample randomly one of the two
    neighboring cores in the performance continuum
    choose the one with lower ED-product
  • neighbor-global similar to neighbor, except
    expected accumulated ED-product used to make
    decision
  • random one other randomly chosen core is
    sampled choose the one with lower ED-product
  • all all other cores are sampled choose the one
    with lower ED-product

26
Realistic heuristics for minimizing energy-delay
product
Achieves up to 93 of the oracle EDP savings
27
Addressing practical implications
  • Heterogeneity gt Higher design effort
  • Limit core-diversity, still works well
  • Use off-the-shelf designs
  • Switching Overhead
  • LOW note the granularity of switching
  • Small ISA differences between the cores
  • Compile programs to the least common denominator
  • Use software traps

28
Conclusion
  • A single-ISA heterogeneous multi-core
    architecture has enormous potential for
    power-savings
  • 39 energy savings for 3 performance loss
  • 63 energy-delay savings for 22 performance loss
  • Realistic heuristics can achieve much of the
    savings potential
  • Up to 93
  • This architecture is very implementable by using
    off-the-shelf designs

29
  • Backup Slides

30
Followup research by other groups(IBM)
31
Followup research by other groups(IBM)
Gentoo Linux 2.6.7 modified for scheduling
32
Followup research by other groups(IBM)
33
Followup research by other groups(IBM)
Similar to our results
34
Followup research by other groups(Intel)
Write a Comment
User Comments (0)
About PowerShow.com