Title: SingleISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction
1Single-ISA Heterogeneous Multi-core
Architectures The Potential for Processor Power
Reduction
- Rakesh Kumar(UCSD)
- Keith Farkas(HP Labs)
- Norman Jouppi(HP Labs)
- Partha Ranganathan(HP Labs)
- Dean Tullsen(UCSD)
2Clock Gating
- Power reduction
- Gating
- Voltage/frequency
- scaling
3Clock Gating
4Clock Gating
5Little core, little power
6- But doesnt adding cores waste real estate?
7It doesbut not much
All cores put together only 15 larger than EV8-
core by itself
8Single-ISA Heterogeneous Multi-Core Architectures
- Have multiple heterogeneous cores on the same die
- Match workload (or workload phase) to core that
achieves best efficiency according to some
objective function - Matching is achieved by the ability to
dynamically switch between cores - Power down the unused cores completely
9Outline of Talk
- Motivation
- A Heterogeneous Multi-core Architecture
- Assumptions
- Decisions
- Methodology
- Results
- Conclusions
10A Heterogeneous Multi-core Architecture
EV8-
EV6
EV5
EV4
3.5MB, 7 way
All cores assumed to be implemented in 0.1micron
and clocked at 2.1GHz, input voltage 1.2V
11Properties of the Cores
EV8- consumes 18 times more power than EV4! It is
more than 85 times bigger too!
12Core-switching
- Goal - so that the application can be on the best
core at all points of time best according to
an objective function - Switching enabled by OS-level scheduler
- If the OS decides a switch is in order
- Powers up the new core
- Triggers a cache flush to save dirty lines to L2
- Signals the new core to start at a
predefined-point - New core powers down the old core and returns
from the time interrupt handler - Hence, only one core powered on at a time
13Power modeling
- Power depends on
- Circuit design style
- Process Parameters
- Activity
- No existing model replicated reported results
- Developed a hybrid model
- Used estimates from Wattch for getting
activity-based dissipation - Used Wattch numbers to scale between known values
- Details in the paper
14Simulation Methodology
- Simulator used A multiprocessor derivative of
SMTSIM - And integrated our power model into it
- Benchmarks used 14 chosen randomly out of
SPEC2000 suite - Fast-forwarded for 2 billion instructions,
simulated for 1 billion instructions. - Finest granularity for switching set to 1 million
instructions.
15Outline
- Motivation
- A Heterogeneous Multi-core Architecture
- Assumptions
- Decisions
- Methodology
- Results
- oracle dynamic
- realistic dynamic
- Conclusions
16Objective Functions
- Typically a composition of energy and performance
- Can be fixed, or can change over time or within
an application simply an OS policy - We evaluate ones based on energy and energy-delay
product (both under performance constraints)
17Dynamically choosing the Core with Least Energy
(perf. losslt10)
18Dynamically choosing the Core with Least Energy
(perf. losslt10)
All cores get used
19Dynamically choosing the Core with Least Energy
(perf. losslt10) Summary of results
- Number of switches relatively infrequent
- EV8- and EV6 emerge as the dominant cores
20Dynamically choosing the Core with Best
Energy-Delay Product (perf. losslt50)
21Dynamically choosing the Core with Best
Energy-Delay Product (perf. losslt50)
Alternates between EV4 and EV6.
22Dynamically choosing the Core with Best
Energy-Delay product (perf. losslt50) Summary of
Results
- Number of switches relatively infrequent
- EV6 and EV4 emerge as the dominant cores
23More on the Oracle Results
- Even for present improvements, beats chip-wide
voltage scaling handsomely (50.6 ED2
improvement) - Effects that make oracle results unrealistic
- Core switching policy must be based on historical
behavior, not on future behavior - Cost of switching cores must be accounted for
24Low core switching ovehead
- Maximum granularity of switching quite coarse
- Infrequent switches long phases and dominant
cores - Shared L2cache reduces cold-start effect
- Also, even though not used in the paper, various
techniques for reducing switching overhead
possible..
25Realistic heuristics for minimizing energy-delay
product
- neighbor Sample randomly one of the two
neighboring cores in the performance continuum
choose the one with lower ED-product - neighbor-global similar to neighbor, except
expected accumulated ED-product used to make
decision - random one other randomly chosen core is
sampled choose the one with lower ED-product - all all other cores are sampled choose the one
with lower ED-product
26Realistic heuristics for minimizing energy-delay
product
Achieves up to 93 of the oracle EDP savings
27Addressing practical implications
- Heterogeneity gt Higher design effort
- Limit core-diversity, still works well
- Use off-the-shelf designs
- Switching Overhead
- LOW note the granularity of switching
- Small ISA differences between the cores
- Compile programs to the least common denominator
- Use software traps
28Conclusion
- A single-ISA heterogeneous multi-core
architecture has enormous potential for
power-savings - 39 energy savings for 3 performance loss
- 63 energy-delay savings for 22 performance loss
- Realistic heuristics can achieve much of the
savings potential - Up to 93
- This architecture is very implementable by using
off-the-shelf designs
29 30Followup research by other groups(IBM)
31Followup research by other groups(IBM)
Gentoo Linux 2.6.7 modified for scheduling
32Followup research by other groups(IBM)
33Followup research by other groups(IBM)
Similar to our results
34Followup research by other groups(Intel)