Title: Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data
1Runtime Power Monitoring in High-End
ProcessorsMethodology and Empirical Data
- Canturk Isci Margaret Martonosi
- Princeton University
MICRO-36
2Motivation
- Power is important!
- Measurement/Modeling techniques help guide design
of power-aware and temperature-aware systems - Real-system measurements
- Help observe long time periods
- Help guide on-the-fly adaptive management
- Our work live, run-time, per-component
power/thermal measures
3Simulation vs. Multimeters
Multimeter Fast Fairly Accurate /- Existing
systems - No on-chip detail
Simulation Arbitrary detail Common base -
Slow - Possibly inaccurate
Counter-Based Power Estimation Fast
(Real-time) Offers estimated view of on-chip
detail But 1) Are the right counters
available? 2) How accurate is it?
4Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - What are some interesting uses of this
measurement approach?
5Power Simulation Overview
- Idealized view For all components in a processor
chip
Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area and Capacitance estimate
Cycle-level simulator gathers event counts,
activity factors and adjusts scaling
6Counter-Based Power Estimation An Overview of
Our Approach
- Idealized view For all components in a processor
chip
Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area Stressmarks
CPU Performance Counters!
From microarch. properties
7Counter-Based Power EstimationGeneral
Implementation
PowerModel
Multimeter
8Specific ExampleIntel Pentium 4
Define P4 Components
Define P4 Events
P4 Perf. Cntr Reader
Power Modeling
Real Power Measurement
9Complete Example Retirement Logic
- Initial MaxPower Area-based estimate
- MaxPower Area x Max Processor Power 4.7W
- Estimates after stressmark tuning
- MaxPower 1.5W ClkPower 2W
10Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - Validation Setup
- Stressmark measurements
- SPEC measurements
- Component-wise validation
- What are some interesting uses of this
measurement approach?
11Validation Measurement Setup
POWER SERVER
Counter based access rates over ethernet
POWER CLIENT
Convert voltage to measured power Convert access
rates to modeled powers Sync together in time
window
12Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - Validation Setup
- Stressmark measurements
- SPEC measurements
- Component-wise validation
- What are some interesting uses of this
measurement approach?
13Counter-based Power Estimation Validation Step
1
Branch exercise (Taken rate 1)
High-Low
L1Dcache (Hit Rate 0.1)
Fast
Using die area proportions for max power
14Counter-based Power EstimationValidation Step 2
Adjusting max power coeff-s on stressmarks
15Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - Validation Setup
- Stressmark measurements
- SPEC measurements
- Component-wise validation
- What are some interesting uses of this
measurement approach?
16Validation for AccuracySPEC Results
17Average SPEC Total Powers
- 1st set Overall, 2nd set Non-idle power
- Average difference between measurement and
estimation 3W - Worst case Equake (5.8W)
18Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - Validation Setup
- Stressmark measurements
- SPEC measurements
- Component-wise validation
- What are some interesting uses of this
measurement approach?
19Per-Component Validation
- What we would like to do
- Compare against gold-standard simulator, or
- Compare against detailed published per-component
measured data - What we can do
- Validate using per-component stressmarks
- Sanity check on trends in real-benchmarks
20Validation for FidelityBenchmark Power
Breakdowns
21Questions We Answer
- To what extent can CPU performance counters act
as proxies to estimate CPU power? - Can counter-based power estimations offer useful
accuracy compared to multimeter measurements? - What are some interesting uses of this
measurement approach?
22Counter-Based Power EstimationUses and Big
Picture
Real Power Measurement
Performance Counters
Per-Component Power Estimation
Thermal Modeling
Power Phases
23Conclusions
- Contributions
- Performance counter based runtime power model and
runtime verification with synchronous real power
measurement for arbitrarily long timescales! - Physical component based power estimates for
processor, which can be used in power phase
analyses and thermal modeling - Outcome
- We can perform reasonably accurate runtime power
estimates without inducing any significant
overhead to power profile
24 25Live CPU Performance Monitoring with Hardware
Counters
Sprunt, 2002
- Most CPUs have hardware performance counters
- Intel P4 Performance Monitoring HW
- 18 Event Counters to count hundreds of possible
events - How to Count? 18 Counter Config Control Registers
- What to Count? 45 Event Selection Control
Registers - Plus additional control registers
26Our Event Counter Performance Reader
- Performance Reader implemented as Linux Loadable
Kernel Module - Implements 6 syscalls
- select_events()
- reset_event_counter()
- start_event_counter()
- stop_event_counter()
- get_event_counts()
- set_replay_MSRs()
- User Level Interface
- Defines the events ? Starts counters
- Stops counters ? Reads counters TSC
27Multimeter Measurements Stressmarks
28Component Breakdowns
Component Breakdowns for branch_exercise Colors
for 4 CPU subsystems
Execution
Issue - Retire
29SPEC2000 Results Example
- Initialization and computation phases
- FP intensive mesh computation phase
- Initialization with high complex IA32 instructions
30Desktop Applications
- We aim to track low power utilizations as well.
- Desktop applications are usually low power with
intermittent power bursts - 3 applications, with common operations such as
open/close application, web, streaming media,
text editing, save to disk, statistical
computations.
31Related Work
- Implementing counter readers
- PCL Berrendorf 1998, Intel VTune, Brink Abyss
Sprunt 2002 - Using counters for Power
- CASTLE Joseph 2001, power profilers
- event driven OS/cruise control Bellosa
2000,2002 - Real Power Measurement
- Compiler Optimizations Seng 2003
- Cycle-accurate measurement with switch caps
Chang 2002 - Power Management and Modeling Support
- Instruction level energy Tiwari 1994
- PowerScope Procedure level energy Flinn 1999
- Event counter driven energy coprocessor Haid
2003 - Virtual Energy Counters for Mem. Kadayif 2001
- ECOsystem OS energy accounting Ellis 2002
32Our Work in Comparison
- Power estimation for a complex, aggressively
clock-gated processor - Component power estimates with physical binding
to die layout - Laying the groundwork for thermal modeling
- Portable implementation with current probe and
power server LKM - Power oriented phase analysis with acquired power
vectors
33 34EXTRA SLIDES
- Counter Access Heuristics
- The complete set of event metrics for access
rates - Power Numbers
- Areas ratios Area based final Max Powers
clocks - Description of Phase Analysis
- A run-through with similarity analysis metrics
- Grouping matrices thresholding algorithm
- Final reconstructed powers
- Current Future Research
- The Real Big Picture
- Phase Analysis
- Thermal Modeling
- Questions Rebuttals
- Verification?
- Events wish list
- Also provides some answers to reviewers
questions
35Presentation Part 2
- IF NOT IN ANY OF THESE SLIDES, IT IS PROBABLY IN
- PRESENTATION PART II
36Counter Access Heuristics
- 1) BUS CONTROL
- No 3rd Level cache ? BSQ allocations IOQ
allocations - Metric1 Bus accesses from all agents
- Event IOQ_allocation
- Counts various types of bus transactions
- Should account for BSQ as well
- access based rather than duration
- MASK
- Default req. type, all read (128B) and write
(64B) types, include OWN,OTHER and PREFETCH - Metric2 Bus Utilization(The of time Bus is
utilized) - Event FSB_data_activity
- Counts DataReaDY and DataBuSY events on Bus
- Mask
- Count when processor or other agents
drive/read/reserve the bus - Expression FSB_data_activity x BusRatio
/ Clocks Elapsed - To account for clock ratios
- Final access rate
37Counter Access Heuristics
- 2) L2 Cache
- Metric 2nd Level cache references
- Event BSQ_cache_reference
- Counts cache ref-s as seen by bus unit
- MASK
- All MESI reads (LD RFO) 2nd level WR misses
- Final Expression
- 3) 2nd Level BPU
- Metric 1 Instructions fetched from L2 (predict)
- Event ITLB_Reference
- Counts ITLB translations
- Mask
- All hits,
- Expression 8xITLB_Reference
- Min. 8 instr-ns per 128B L2 line
- Metric 2 Branches retired (history update)
- Event branch_retired
38(Max IA32 Instruction Length for 2nd Level BPU)
- What this boils down to is that the "official"
maximum instruction length is - 4 prefix
- 2 opcode
- 1 modrm
- 1 sib
- 4 displacement
- 4 immediate
- ------------------------
- 16 bytes
- ...though one may want to assume a 24-byte
instruction length to avoid possible buffer
overruns when disassembling instructions with
more than 4 prefix bytes.
39Counter Access Heuristics
- 4) ITLB I-Fetch
- Metric 1 ITLB translations performed
- Event ITLB_Reference
- Counts ITLB translations
- Mask
- All hits, misses
- Metric 2 Intruction fetch requests by the front
end BPU - Event BPU_fetch_requests
- Counts Ifetch requests from the BPU
- Mask
- TC lookup misses ltALL thats available for nowgt
- Final expression
40Counter Access Heuristics
- 5) L1 Cache
- Metric 1 Load Store retired
- Event Front End Event
- Counts tagged uops that retired
- Mask
- Count also speculatives (BOGUS)
- Supporting event Uop type
- Tags Load and Store instructions
- Mask
- Tag both Loads and Stores
- Metric 2 Replays (For Data Speculation)
- Events
- 1) LD port replay
- Counts replayed events at load port
- Mask
- Split LD ltALL that is available for nowgt
- 2) ST port replay (Same as Memory Complete, Mask
SSC) - Counts replayed events at store port
- Mask
41Counter Access Heuristics
- 6) MOB
- Metric 1 LD Replays triggered by MOB
- Event MOB load replay
- Counts the load operations replayed by MOB
- Mask
- All replays due to unknown address/data, partial
data match, misaligned addresses - No metric for MOB accesses!
- Final expression
42Counter Access Heuristics
- 7) Memory Control
- Metric 1 Non-idle cycle
- Event Machine Clear
- Counts cycles when the pipeline is flushed
- Mask
- Machine clears due to any cause
- Expression TSC count Machine Clear Cycles
- Final Expression
43Counter Access Heuristics
- 8) DTLB
- Metric 1 Accesses to either L1 or to MOB
- Expression L1 Accesses MOB Accesses
- Final expression
44Counter Access Heuristics
- 10) FP Execution
- Metric FP instructions executed
- event1 packed_SP_uop
- counts packed single precision uops
- event2 packed_DP_uop
- counts packed double precision uops
- event3 scalar_SP_uop
- counts scalar single precision uops
- event4 scalar_DP_uop
- counts scalar double precision uops
- event5 64bit_MMX_uop
- counts MMX uops with 64bit SIMD operands
- event6 128bit_MMX_uop
- counts integer SSE2 uops with 128bit SIMD
operands - event7 x87_FP_UOP
- counts x87 FP uops
- Masks1-7 Count ALL ltOnly available optiongt
- event8 x87_SIMD_moves_uop
- counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops
45Counter Access Heuristics
- 10) FP Execution
- Final Expression
46Counter Access Heuristics
- 9) Integer Execution
- Metric 1 Integer uops executed
- Event ltNo associated eventgt
- Substitute Metric Total speculative Uops
executed - Event Uop queue writes
- Number of uops written to the uop queue in front
of TC - Mask
- All uops from TC, Decoder and Microcode ROM
- Expression Uop Rate FP uop rate
- Postfix for simple vs complex ALU operations
- Final expression
- (Actually I rescale FP uop rates as packed,SIMD
and MMX uops do multiple concurrent FP
operations)
47Counter Access Heuristics
- 11) Integer Regfile
- Metric Integer uops executed
- No direct metric for total physical regfile
accesses - Final expression
- 12) FP Regfile
- Metric FP uops executed
- No direct metric for total physical regfile
accesses - Final expression
48Counter Access Heuristics
- 13) Instruction Decode
- Metric 1 Cycles spent in trace building
- Event TC Deliver Mode
- Counts the cycles processor spends in the
specified mode - Mask
- Logical processor 0 in build mode
- Final expression
49Counter Access Heuristics
- 14,15) Trace Cache
- Metric Uop queue writes from either modes
- Event Uop queue writes
- Counts Number of uops written to the uop queue in
front of TC - Mask
- All uops from TC and Decoder and ROM
- Final expression
50Counter Access Heuristics
- 16) 1st Level BPU
- Metric 1 Branches retired
- Event branch_retired
- Counts branches retired
- Mask
- Count all Taken/NT/Predicted/MissP
- Final expression
51Counter Access Heuristics
- 17) Microcode ROM
- Metric 1 Uops originating from ROM
- Event Uop queue writes
- Counts Number of uops written to the uop queue in
front of TC - Mask
- Uops only from microcode ROM
- Final expression
52Counter Access Heuristics
- 18) Allocation 19) Rename 20)
Instruction queue1 21) Schedule 22)
Instruction queue2 - Metric Uops that started their flight
- Event Uop queue writes
- Counts Number of uops written to the uop queue
- Mask
- All uops from TC and Decoder and ROM
- Final expression
53Counter Access Heuristics
- 23) Retirement Logic
- Metric Uops that arrive retirement
- Event Uops retired
- Counts number of uops retired in a cycle
- Mask
- Consider also speculative
- Final expression
Back
54Evolution of Power Numbers
Unit Area Area Based Max Power Estimate Max Power after Tuning Conditional Clk power
L2 BPU 3.4 2.5 15.5 -
L1 BPU 4.9 3.5 10.5
Tr. Cache 8.6 6.2 4.0 2.0
L1 Cache 5.8 4.2 12.4 (/2)
L2 Cache 14.7 10.6 300.6(/7)
Int EXE 2.0 1.4 3.4
FP EXE 6.2 4.5 4.5
Rename 2.3 1.7 0.4 1.5
Retire 6.5 4.7(/3) 0.5 2.0
Power Numbers
55Our Power Phase Analysis
- Goal
- Identify phases in program power behavior
- Determine execution points that correspond to
these phases - Define small set of power signatures that
represent overall power behavior
Descsiption of Phase Analysis
56Our Approach
- Our Approach Outline
- Collect samples of estimated power values for
processor sub-units ltPower Vectorsgt at
application runtime - Define a power vector similarity metric
- Group sampled program execution into phases
- Determine execution points and representative
signature vectors for each phase group - Analyze the accuracy of our approximation
Descsiption of Phase Analysis
57Power Vector Similarity Metric
- How to quantify the power behavior
dissimilarity between two execution points? - Consider solely total power difference ?
- Consider manhattan distance between the
corresponding 2 vectors ? - Consider manhattan distance between the
corresponding 2 vectors normalized ? - Consider a combination of (2) (3) ?
- Construct a similarity matrix to represent
similarity among all pairs of execution points - Each entry in the similarity matrix
Descsiption of Phase Analysis
58Similarity Based on Both Absolute and Normalized
Power Vectors
Descsiption of Phase Analysis
59Grouping Execution Points
- Thresholding Algorithm
- Define a threshold of similarity lt of max
dissimilaritygt - Start from first execution point (0,0) and
identify ones in the fwd execution path that lie
within threshold for both normalized and absolute
metrics - Tag the corresponding execution points (j,j) as
the same group - Find next untagged execution point (r,r) and do
the same along forward path - Rule A tagged execution point cannot add new
elements to its group! - We demonstrate the outcome of thresholding with
Grouping Matrices
Descsiption of Phase Analysis
60Gzip Grouping Matrices
Original Similarity Matrix
of Groups 909
of Groups 254
of Groups 33
of Groups 3
of Groups 1
Descsiption of Phase Analysis
- Gzip has 974 power vectors
- Cluster vectors based on similarity using
thresholding - Max Gzip power dissimilarity 47.35W
61Representative Vectors Execution Points
- We have each execution point assigned to a group
- For Each Group
- For Each Execution Point
- We can represent whole execution with as many
power vectors as the number of generated groups
Define a representative vector as the average of all instances of that group Select the execution point that started the group (The earliest point in each group)
Descsiption of Phase Analysis
Assign the corresponding groups representative vector as that points power vector Assign the power vector of the selected execution point for that group as that points power vector
62Reconstructing Power Trace with Representative
Vectors
Descsiption of Phase Analysis
63Reconstructing Power Trace with Selected
Execution Points
Descsiption of Phase Analysis
64Similarity Matrix Example
0
1
2
3
- Consider 4 vectors, each with 4 dimensions
Descsiption of Phase Analysis
- Log all distances in the similarity matrix
- Color-scale from black to white (only for upper
diagonal)
0
1
2
3
0
1
2
3
Similarity Matrix Plot
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0
0
0
1
1
1
2
2
2
3
3
3
65Similarity Matrix Example
0
1
2
3
- Consider 4 vectors, each with 4 dimensions
Descsiption of Phase Analysis
- Log all distances in the similarity matrix
- Color-scale from black to white (only for upper
diagonal)
0
1
2
3
0
1
2
3
Similarity Matrix Plot
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0
0
0
1
1
1
2
2
2
3
3
3
66Interpreting Similarity Matrix Plot
Similarity Matrix Plot
0
1
Descsiption of Phase Analysis
2
3
Back
67Grouping Matrix Example
- Mark execution pairs with distance Threshold
Descsiption of Phase Analysis
Grouping Matrix Plots
1
2
3
1
2
3
0
0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0
0
0
1
1
1
Threshold 10
2
2
2
3
3
3
1
2
3
1
2
3
0
0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0
0
0
1
1
1
Threshold 50
2
2
2
Back
3
3
3
68Current Future Research
- FOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED
TO POWER ESTIMATION AND PHASES. PLANS FOR FUTURE
RESEARCH ARE ALSO DISCUSSED
690) THE REAL BIG PICTURE
Bottom line
- To Estimate component power temperature
breakdowns for P4 at runtime - To analyze how power phase behavior relates to
program structure
701) Phase Branch
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- NOT YET?
71POWER PHASE BEHAVIOR
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- NOT YET
72Identifying Power Phases
- Most of the methodology is ready
- Complete Gzip case in Isci Martonosi WWC-6
- Extensibility to other benchmarks
- Generated similarity metrics for several
- Performed phase identification with thresholding
for all - Repeatibilty of the experiment
- Several other possible ideas such as
- Thresholding k-means clustering
- Two-pass thresholding
- PCA for dimension reduction (or SVD?)
- Manhattan L1 norm
- Euclidian (L2) not interesting
- Chebyschev (Linf) - ??
73Program Execution Profile
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- NOT YET
74Program Execution Profile
- Sample program flow simultaneously with power
- Our LKM implementation PCsampler
- Not Finished
- Generate code space similarity in parallel with
power space similarity - Relative comparisons of methods for
- Complexity
- Accuracy
- Applicability, etc.
75CURRENT STATE
- Sample PC ? Binding to functions
- Reacquire PID
- Those SPECs,
- Runspec always in fixed address at ELF_program
interpreter - Benches change pid between datasets
- Verify PC with objdump
- So we can make sure it is the PC were sampling
76Initial Data PC?? Trace For gzip-source
Correspond to ltsend_bitsgtltbit_reversegtltlm_initgt
ltlongest_matchgtltfill_windowgt functions
Back
772) Thermal Modeling
- Related Work
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Refined Thermal Model
- Ex Ppro Thermal Model
78THERMAL MODELING A Basic Model
- Based on lumpedR-C model from packaging
- Built uponpower modeling
- Sampled Component Powers
- Respective component areas
- Physical processor Parameters
- Packaging
- Heat Transfer
?t Sampling interval Ti The temperature
difference between block and the heatsink
79Refined Thermal Model
- Steady State Analysis reveals, Heatsink-Die
abstraction is not sufficient for real systems - Proceeding to a multilayer thermal model
- Active die thickness
- metalization/insulation
- chip-package interface
- package
- heatsink
- Requires searching of several materials/
dimensions and thermal properties - Multiple layers ?
- Multiple T nodes ? Multiple DEs
- Baseline Heat removal Structure
80Physical Structure vs. Thermal Model
Ambient Temperature
Ambient Airflow
Heatsink
Thermal Grease
Heat Spreader
Package
Die
81Analytical Derivation
82EX Ppro Thermal Model
- Use CASTLE Joseph, 2001 computed component
powers - Determine component areas from Die photo
- Determine processor/packaging physical parameters
- Generate numerical thermal model
- Apply component difference equations recursively
along power flow
83Simulation Outputs
- Thermal nodes updated every ?t20ms
- Component Temperatures Build up to 350K in 5hrs
- Theatsink moves very slowly as expected
84Verification of Results (1/3)
- Full validation requires comparing component
power estimations to real measured component
powers for all demonstrated full execution traces - No such published data available
- Is it possible to acquire such data?
- How to probe intra-chip components?
- Closest would be Hspice simulations
- Probably infeasible to acquire traces of minutes
- No P4 power simulator
- 1 benchmark would take 1 CPU-month with current
power simulation speeds
Questions Rebuttals
85Verification of Results (2/3)
- We use proxies
- 1) Verifying against total measured power at
runtime - Provides us with an immediate comparison
- Most tested benchmarks show close approximations
- 2) Behavior of component powers for simpler
benchmarks - We show power trends follow expectations under
different corners of execution
Questions Rebuttals
86Verification of Results (3/3)
- Why not an RMS error measure between measurement
and estimate for total power not included? - As two sources of data are completely
independent, i.e. measurement branch coming from
serial interface and modeling part from ethernet,
they are not perfectly synchronized. There are
spikes at power jumps - Removing those by hand would question fidelity
Questions Rebuttals
Art RMS error (overall) 4.38W Art RMS
error (100-1000s) 4.21W
87Possibility for Other Processors?
- Most recent processors are keen on power
management - There will be enough power variability to exploit
for power modeling and phase analysis - Porting the power estimation to other
architectures - Requires significant effort to
- Define power related metrics
- Implement counter reader and power estimation
user and kernel SW - Porting to same architecture, different
implementation - More straightforward
- Reevaluate max/idle/gated power estimates
- Experiences with other architectures
- Castle project for Pentium Pro (P6)
- Few watts of variation
- Low dimensionality
- IBM Power3 II
- Very low measured power variation
Questions Rebuttals
88Event Counter Wish List (1/2)
- Problems from experience
- Memory Related Metrics
- L2 metrics are complicated
- Do not correspond to L2 hits/misses (see
optimization reference manual) - Granularity issues
- MOB accesses metric?
- Memory Control?
- Integer and FP accesses
- FPE has 8 separate events (with 2 dedicated
ESCRs) - Need at least 4 rotations to collect
- INTE has no direct measure
- Cannot differentiate multiply, shifts, logic,
arithmetic - Out of order engine
- Cannot differentiate between
- Allocate, Rename, Instruction Queues, Schedule
Questions Rebuttals
89Event Counter Wish List (2/2)
- Suggestions for Future
- Ultimately
- Specific counters that represent component-wise
utilizations - Switching bookkeepers for singly ended lines
- Specific to P4 Xeon Generation
- Metrics directly related to memory components
- Single aggregate metrics for FP and Int execution
- Metrics that explore out of order engine
components
Questions Rebuttals
90Applications of our Technique
- Already discussed
- Power phase analysis
- Thermal modeling
- In addition
- OS based energy accounting
- i.e. ECOSystem Duke C. Ellis et al.
- Fine grained CPU power (rather than ON/OFF)
- Detailed SW energy mappings
- i.e. PowerScope CMU Flinn Satyanarayanan
- Dynamic Power Management
- i.e. Process Cruise Control Weissel Bellosa
- Event driven / Power model driven clock scaling
Questions Rebuttals
91Statistical Methods (1/2)
- Several optimization/curve fitting tools
- JMP, R, SPSS, S, Stata
- Regression
- Used by some previous examples
- 2k factorial design
- K factors, each at 2 levels (Hi Lo)
- Run design at each 2k corners
- Analyze the interactions and individual factors
effects - Turn factors ON/OFF
Questions Rebuttals
92Statistical Methods (2/2)
- We believe, we can produce much closer matching
to measured total power with curve fitting
techniques and PCA (to define a new set of
orthogonal activity factors) - Topic for future research
- We avoid to preserve the binding of activity
factors to physical component powers as required
for thermal analysis - Note on 2k factorial design
- Our factors are
- Access rates OR Component power estimations
- Our response variable
- Measured total power
- However
- We cannot independently turn ON/OFF the accesses
to individual components
Questions Rebuttals