Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data - PowerPoint PPT Presentation

About This Presentation
Title:

Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data

Description:

Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36 Motivation Power is ... – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 90
Provided by: canturkis
Category:

less

Transcript and Presenter's Notes

Title: Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data


1
Runtime Power Monitoring in High-End
ProcessorsMethodology and Empirical Data
  • Canturk Isci Margaret Martonosi
  • Princeton University

MICRO-36
2
Motivation
  • Power is important!
  • Measurement/Modeling techniques help guide design
    of power-aware and temperature-aware systems
  • Real-system measurements
  • Help observe long time periods
  • Help guide on-the-fly adaptive management
  • Our work live, run-time, per-component
    power/thermal measures

3
Simulation vs. Multimeters
Multimeter Fast Fairly Accurate /- Existing
systems - No on-chip detail
Simulation Arbitrary detail Common base -
Slow - Possibly inaccurate
Counter-Based Power Estimation Fast
(Real-time) Offers estimated view of on-chip
detail But 1) Are the right counters
available? 2) How accurate is it?
4
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • What are some interesting uses of this
    measurement approach?

5
Power Simulation Overview
  • Idealized view For all components in a processor
    chip

Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area and Capacitance estimate
Cycle-level simulator gathers event counts,
activity factors and adjusts scaling
6
Counter-Based Power Estimation An Overview of
Our Approach
  • Idealized view For all components in a processor
    chip

Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area Stressmarks
CPU Performance Counters!
From microarch. properties
7
Counter-Based Power EstimationGeneral
Implementation
PowerModel
Multimeter
8
Specific ExampleIntel Pentium 4
Define P4 Components
Define P4 Events
P4 Perf. Cntr Reader
Power Modeling
Real Power Measurement
9
Complete Example Retirement Logic
  • Initial MaxPower Area-based estimate
  • MaxPower Area x Max Processor Power 4.7W
  • Estimates after stressmark tuning
  • MaxPower 1.5W ClkPower 2W

10
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • Validation Setup
  • Stressmark measurements
  • SPEC measurements
  • Component-wise validation
  • What are some interesting uses of this
    measurement approach?

11
Validation Measurement Setup
POWER SERVER
Counter based access rates over ethernet
POWER CLIENT
Convert voltage to measured power Convert access
rates to modeled powers Sync together in time
window
12
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • Validation Setup
  • Stressmark measurements
  • SPEC measurements
  • Component-wise validation
  • What are some interesting uses of this
    measurement approach?

13
Counter-based Power Estimation Validation Step
1
Branch exercise (Taken rate 1)
High-Low
L1Dcache (Hit Rate 0.1)
Fast
Using die area proportions for max power
14
Counter-based Power EstimationValidation Step 2
Adjusting max power coeff-s on stressmarks
15
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • Validation Setup
  • Stressmark measurements
  • SPEC measurements
  • Component-wise validation
  • What are some interesting uses of this
    measurement approach?

16
Validation for AccuracySPEC Results
17
Average SPEC Total Powers
  • 1st set Overall, 2nd set Non-idle power
  • Average difference between measurement and
    estimation 3W
  • Worst case Equake (5.8W)

18
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • Validation Setup
  • Stressmark measurements
  • SPEC measurements
  • Component-wise validation
  • What are some interesting uses of this
    measurement approach?

19
Per-Component Validation
  • What we would like to do
  • Compare against gold-standard simulator, or
  • Compare against detailed published per-component
    measured data
  • What we can do
  • Validate using per-component stressmarks
  • Sanity check on trends in real-benchmarks

20
Validation for FidelityBenchmark Power
Breakdowns
21
Questions We Answer
  • To what extent can CPU performance counters act
    as proxies to estimate CPU power?
  • Can counter-based power estimations offer useful
    accuracy compared to multimeter measurements?
  • What are some interesting uses of this
    measurement approach?

22
Counter-Based Power EstimationUses and Big
Picture
Real Power Measurement
Performance Counters
Per-Component Power Estimation
Thermal Modeling
Power Phases
23
Conclusions
  • Contributions
  • Performance counter based runtime power model and
    runtime verification with synchronous real power
    measurement for arbitrarily long timescales!
  • Physical component based power estimates for
    processor, which can be used in power phase
    analyses and thermal modeling
  • Outcome
  • We can perform reasonably accurate runtime power
    estimates without inducing any significant
    overhead to power profile

24
  • IF TIME PERMITS

25
Live CPU Performance Monitoring with Hardware
Counters
Sprunt, 2002
  • Most CPUs have hardware performance counters
  • Intel P4 Performance Monitoring HW
  • 18 Event Counters to count hundreds of possible
    events
  • How to Count? 18 Counter Config Control Registers
  • What to Count? 45 Event Selection Control
    Registers
  • Plus additional control registers

26
Our Event Counter Performance Reader
  • Performance Reader implemented as Linux Loadable
    Kernel Module
  • Implements 6 syscalls
  • select_events()
  • reset_event_counter()
  • start_event_counter()
  • stop_event_counter()
  • get_event_counts()
  • set_replay_MSRs()
  • User Level Interface
  • Defines the events ? Starts counters
  • Stops counters ? Reads counters TSC

27
Multimeter Measurements Stressmarks
28
Component Breakdowns
Component Breakdowns for branch_exercise Colors
for 4 CPU subsystems
Execution
Issue - Retire
29
SPEC2000 Results Example
  • Equake
  • FP benchmark
  • Initialization and computation phases
  • FP intensive mesh computation phase
  • Initialization with high complex IA32 instructions

30
Desktop Applications
  • We aim to track low power utilizations as well.
  • Desktop applications are usually low power with
    intermittent power bursts
  • 3 applications, with common operations such as
    open/close application, web, streaming media,
    text editing, save to disk, statistical
    computations.

31
Related Work
  • Implementing counter readers
  • PCL Berrendorf 1998, Intel VTune, Brink Abyss
    Sprunt 2002
  • Using counters for Power
  • CASTLE Joseph 2001, power profilers
  • event driven OS/cruise control Bellosa
    2000,2002
  • Real Power Measurement
  • Compiler Optimizations Seng 2003
  • Cycle-accurate measurement with switch caps
    Chang 2002
  • Power Management and Modeling Support
  • Instruction level energy Tiwari 1994
  • PowerScope Procedure level energy Flinn 1999
  • Event counter driven energy coprocessor Haid
    2003
  • Virtual Energy Counters for Mem. Kadayif 2001
  • ECOsystem OS energy accounting Ellis 2002

32
Our Work in Comparison
  • Power estimation for a complex, aggressively
    clock-gated processor
  • Component power estimates with physical binding
    to die layout
  • Laying the groundwork for thermal modeling
  • Portable implementation with current probe and
    power server LKM
  • Power oriented phase analysis with acquired power
    vectors

33
  • EOP

34
EXTRA SLIDES
  • Counter Access Heuristics
  • The complete set of event metrics for access
    rates
  • Power Numbers
  • Areas ratios Area based final Max Powers
    clocks
  • Description of Phase Analysis
  • A run-through with similarity analysis metrics
  • Grouping matrices thresholding algorithm
  • Final reconstructed powers
  • Current Future Research
  • The Real Big Picture
  • Phase Analysis
  • Thermal Modeling
  • Questions Rebuttals
  • Verification?
  • Events wish list
  • Also provides some answers to reviewers
    questions

35
Presentation Part 2
  • IF NOT IN ANY OF THESE SLIDES, IT IS PROBABLY IN
  • PRESENTATION PART II

36
Counter Access Heuristics
  • 1) BUS CONTROL
  • No 3rd Level cache ? BSQ allocations IOQ
    allocations
  • Metric1 Bus accesses from all agents
  • Event IOQ_allocation
  • Counts various types of bus transactions
  • Should account for BSQ as well
  • access based rather than duration
  • MASK
  • Default req. type, all read (128B) and write
    (64B) types, include OWN,OTHER and PREFETCH
  • Metric2 Bus Utilization(The of time Bus is
    utilized)
  • Event FSB_data_activity
  • Counts DataReaDY and DataBuSY events on Bus
  • Mask
  • Count when processor or other agents
    drive/read/reserve the bus
  • Expression FSB_data_activity x BusRatio
    / Clocks Elapsed
  • To account for clock ratios
  • Final access rate

37
Counter Access Heuristics
  • 2) L2 Cache
  • Metric 2nd Level cache references
  • Event BSQ_cache_reference
  • Counts cache ref-s as seen by bus unit
  • MASK
  • All MESI reads (LD RFO) 2nd level WR misses
  • Final Expression
  • 3) 2nd Level BPU
  • Metric 1 Instructions fetched from L2 (predict)
  • Event ITLB_Reference
  • Counts ITLB translations
  • Mask
  • All hits,
  • Expression 8xITLB_Reference
  • Min. 8 instr-ns per 128B L2 line
  • Metric 2 Branches retired (history update)
  • Event branch_retired

38
(Max IA32 Instruction Length for 2nd Level BPU)
  • What this boils down to is that the "official"
    maximum instruction length is
  • 4 prefix
  • 2 opcode
  • 1 modrm
  • 1 sib
  • 4 displacement
  • 4 immediate
  • ------------------------
  • 16 bytes
  • ...though one may want to assume a 24-byte
    instruction length to avoid possible buffer
    overruns when disassembling instructions with
    more than 4 prefix bytes.

39
Counter Access Heuristics
  • 4) ITLB I-Fetch
  • Metric 1 ITLB translations performed
  • Event ITLB_Reference
  • Counts ITLB translations
  • Mask
  • All hits, misses
  • Metric 2 Intruction fetch requests by the front
    end BPU
  • Event BPU_fetch_requests
  • Counts Ifetch requests from the BPU
  • Mask
  • TC lookup misses ltALL thats available for nowgt
  • Final expression

40
Counter Access Heuristics
  • 5) L1 Cache
  • Metric 1 Load Store retired
  • Event Front End Event
  • Counts tagged uops that retired
  • Mask
  • Count also speculatives (BOGUS)
  • Supporting event Uop type
  • Tags Load and Store instructions
  • Mask
  • Tag both Loads and Stores
  • Metric 2 Replays (For Data Speculation)
  • Events
  • 1) LD port replay
  • Counts replayed events at load port
  • Mask
  • Split LD ltALL that is available for nowgt
  • 2) ST port replay (Same as Memory Complete, Mask
    SSC)
  • Counts replayed events at store port
  • Mask

41
Counter Access Heuristics
  • 6) MOB
  • Metric 1 LD Replays triggered by MOB
  • Event MOB load replay
  • Counts the load operations replayed by MOB
  • Mask
  • All replays due to unknown address/data, partial
    data match, misaligned addresses
  • No metric for MOB accesses!
  • Final expression

42
Counter Access Heuristics
  • 7) Memory Control
  • Metric 1 Non-idle cycle
  • Event Machine Clear
  • Counts cycles when the pipeline is flushed
  • Mask
  • Machine clears due to any cause
  • Expression TSC count Machine Clear Cycles
  • Final Expression

43
Counter Access Heuristics
  • 8) DTLB
  • Metric 1 Accesses to either L1 or to MOB
  • Expression L1 Accesses MOB Accesses
  • Final expression

44
Counter Access Heuristics
  • 10) FP Execution
  • Metric FP instructions executed
  • event1 packed_SP_uop
  • counts packed single precision uops
  • event2 packed_DP_uop
  • counts packed double precision uops
  • event3 scalar_SP_uop
  • counts scalar single precision uops
  • event4 scalar_DP_uop
  • counts scalar double precision uops
  • event5 64bit_MMX_uop
  • counts MMX uops with 64bit SIMD operands
  • event6 128bit_MMX_uop
  • counts integer SSE2 uops with 128bit SIMD
    operands
  • event7 x87_FP_UOP
  • counts x87 FP uops
  • Masks1-7 Count ALL ltOnly available optiongt
  • event8 x87_SIMD_moves_uop
  • counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops

45
Counter Access Heuristics
  • 10) FP Execution
  • Final Expression

46
Counter Access Heuristics
  • 9) Integer Execution
  • Metric 1 Integer uops executed
  • Event ltNo associated eventgt
  • Substitute Metric Total speculative Uops
    executed
  • Event Uop queue writes
  • Number of uops written to the uop queue in front
    of TC
  • Mask
  • All uops from TC, Decoder and Microcode ROM
  • Expression Uop Rate FP uop rate
  • Postfix for simple vs complex ALU operations
  • Final expression
  • (Actually I rescale FP uop rates as packed,SIMD
    and MMX uops do multiple concurrent FP
    operations)

47
Counter Access Heuristics
  • 11) Integer Regfile
  • Metric Integer uops executed
  • No direct metric for total physical regfile
    accesses
  • Final expression
  • 12) FP Regfile
  • Metric FP uops executed
  • No direct metric for total physical regfile
    accesses
  • Final expression

48
Counter Access Heuristics
  • 13) Instruction Decode
  • Metric 1 Cycles spent in trace building
  • Event TC Deliver Mode
  • Counts the cycles processor spends in the
    specified mode
  • Mask
  • Logical processor 0 in build mode
  • Final expression

49
Counter Access Heuristics
  • 14,15) Trace Cache
  • Metric Uop queue writes from either modes
  • Event Uop queue writes
  • Counts Number of uops written to the uop queue in
    front of TC
  • Mask
  • All uops from TC and Decoder and ROM
  • Final expression

50
Counter Access Heuristics
  • 16) 1st Level BPU
  • Metric 1 Branches retired
  • Event branch_retired
  • Counts branches retired
  • Mask
  • Count all Taken/NT/Predicted/MissP
  • Final expression

51
Counter Access Heuristics
  • 17) Microcode ROM
  • Metric 1 Uops originating from ROM
  • Event Uop queue writes
  • Counts Number of uops written to the uop queue in
    front of TC
  • Mask
  • Uops only from microcode ROM
  • Final expression

52
Counter Access Heuristics
  • 18) Allocation 19) Rename 20)
    Instruction queue1 21) Schedule 22)
    Instruction queue2
  • Metric Uops that started their flight
  • Event Uop queue writes
  • Counts Number of uops written to the uop queue
  • Mask
  • All uops from TC and Decoder and ROM
  • Final expression

53
Counter Access Heuristics
  • 23) Retirement Logic
  • Metric Uops that arrive retirement
  • Event Uops retired
  • Counts number of uops retired in a cycle
  • Mask
  • Consider also speculative
  • Final expression

Back
54
Evolution of Power Numbers
Unit Area Area Based Max Power Estimate Max Power after Tuning Conditional Clk power
L2 BPU 3.4 2.5 15.5 -
L1 BPU 4.9 3.5 10.5
Tr. Cache 8.6 6.2 4.0 2.0
L1 Cache 5.8 4.2 12.4 (/2)
L2 Cache 14.7 10.6 300.6(/7)
Int EXE 2.0 1.4 3.4
FP EXE 6.2 4.5 4.5
Rename 2.3 1.7 0.4 1.5
Retire 6.5 4.7(/3) 0.5 2.0
Power Numbers
55
Our Power Phase Analysis
  • Goal
  • Identify phases in program power behavior
  • Determine execution points that correspond to
    these phases
  • Define small set of power signatures that
    represent overall power behavior

Descsiption of Phase Analysis
56
Our Approach
  • Our Approach Outline
  • Collect samples of estimated power values for
    processor sub-units ltPower Vectorsgt at
    application runtime
  • Define a power vector similarity metric
  • Group sampled program execution into phases
  • Determine execution points and representative
    signature vectors for each phase group
  • Analyze the accuracy of our approximation

Descsiption of Phase Analysis
57
Power Vector Similarity Metric
  • How to quantify the power behavior
    dissimilarity between two execution points?
  • Consider solely total power difference ?
  • Consider manhattan distance between the
    corresponding 2 vectors ?
  • Consider manhattan distance between the
    corresponding 2 vectors normalized ?
  • Consider a combination of (2) (3) ?
  • Construct a similarity matrix to represent
    similarity among all pairs of execution points
  • Each entry in the similarity matrix

Descsiption of Phase Analysis
58
Similarity Based on Both Absolute and Normalized
Power Vectors
Descsiption of Phase Analysis
59
Grouping Execution Points
  • Thresholding Algorithm
  • Define a threshold of similarity lt of max
    dissimilaritygt
  • Start from first execution point (0,0) and
    identify ones in the fwd execution path that lie
    within threshold for both normalized and absolute
    metrics
  • Tag the corresponding execution points (j,j) as
    the same group
  • Find next untagged execution point (r,r) and do
    the same along forward path
  • Rule A tagged execution point cannot add new
    elements to its group!
  • We demonstrate the outcome of thresholding with
    Grouping Matrices

Descsiption of Phase Analysis
60
Gzip Grouping Matrices
Original Similarity Matrix
of Groups 909
of Groups 254
of Groups 33
of Groups 3
of Groups 1
Descsiption of Phase Analysis
  • Gzip has 974 power vectors
  • Cluster vectors based on similarity using
    thresholding
  • Max Gzip power dissimilarity 47.35W

61
Representative Vectors Execution Points
  • We have each execution point assigned to a group
  • For Each Group
  • For Each Execution Point
  • We can represent whole execution with as many
    power vectors as the number of generated groups

Define a representative vector as the average of all instances of that group Select the execution point that started the group (The earliest point in each group)
Descsiption of Phase Analysis
Assign the corresponding groups representative vector as that points power vector Assign the power vector of the selected execution point for that group as that points power vector
62
Reconstructing Power Trace with Representative
Vectors
Descsiption of Phase Analysis
63
Reconstructing Power Trace with Selected
Execution Points
Descsiption of Phase Analysis
64
Similarity Matrix Example
0
1
2
3
  • Consider 4 vectors, each with 4 dimensions

Descsiption of Phase Analysis
  • Log all distances in the similarity matrix
  • Color-scale from black to white (only for upper
    diagonal)

0
1
2
3
0
1
2
3
Similarity Matrix Plot
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0




0
0
0
1
1
1
2
2
2
3
3
3
65
Similarity Matrix Example
0
1
2
3
  • Consider 4 vectors, each with 4 dimensions

Descsiption of Phase Analysis
  • Log all distances in the similarity matrix
  • Color-scale from black to white (only for upper
    diagonal)

0
1
2
3
0
1
2
3
Similarity Matrix Plot
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0




0
0
0
1
1
1
2
2
2
3
3
3
66
Interpreting Similarity Matrix Plot
Similarity Matrix Plot




0
1
Descsiption of Phase Analysis
2
3
Back
67
Grouping Matrix Example
  • Consider same 4 vectors
  • Mark execution pairs with distance Threshold

Descsiption of Phase Analysis
Grouping Matrix Plots
1
2
3
1
2
3
0
0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0




0
0
0
1
1
1
Threshold 10
2
2
2
3
3
3
1
2
3
1
2
3
0
0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0
0 6 3 7
6 0 3 6
3 3 0 7
7 6 7 0




0
0
0
1
1
1
Threshold 50
2
2
2
Back
3
3
3
68
Current Future Research
  • FOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED
    TO POWER ESTIMATION AND PHASES. PLANS FOR FUTURE
    RESEARCH ARE ALSO DISCUSSED

69
0) THE REAL BIG PICTURE
Bottom line
  • To Estimate component power temperature
    breakdowns for P4 at runtime
  • To analyze how power phase behavior relates to
    program structure

70
1) Phase Branch
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • NOT YET?

71
POWER PHASE BEHAVIOR
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • NOT YET

72
Identifying Power Phases
  • Most of the methodology is ready
  • Complete Gzip case in Isci Martonosi WWC-6
  • Extensibility to other benchmarks
  • Generated similarity metrics for several
  • Performed phase identification with thresholding
    for all
  • Repeatibilty of the experiment
  • Several other possible ideas such as
  • Thresholding k-means clustering
  • Two-pass thresholding
  • PCA for dimension reduction (or SVD?)
  • Manhattan L1 norm
  • Euclidian (L2) not interesting
  • Chebyschev (Linf) - ??

73
Program Execution Profile
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • NOT YET

74
Program Execution Profile
  • Sample program flow simultaneously with power
  • Our LKM implementation PCsampler
  • Not Finished
  • Generate code space similarity in parallel with
    power space similarity
  • Relative comparisons of methods for
  • Complexity
  • Accuracy
  • Applicability, etc.

75
CURRENT STATE
  • Sample PC ? Binding to functions
  • Reacquire PID
  • Those SPECs,
  • Runspec always in fixed address at ELF_program
    interpreter
  • Benches change pid between datasets
  • Verify PC with objdump
  • So we can make sure it is the PC were sampling

76
Initial Data PC?? Trace For gzip-source
Correspond to ltsend_bitsgtltbit_reversegtltlm_initgt
ltlongest_matchgtltfill_windowgt functions
Back
77
2) Thermal Modeling
  • Related Work
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Refined Thermal Model
  • Ex Ppro Thermal Model

78
THERMAL MODELING A Basic Model
  • Based on lumpedR-C model from packaging
  • Built uponpower modeling
  • Sampled Component Powers
  • Respective component areas
  • Physical processor Parameters
  • Packaging
  • Heat Transfer

?t Sampling interval Ti The temperature
difference between block and the heatsink
79
Refined Thermal Model
  • Steady State Analysis reveals, Heatsink-Die
    abstraction is not sufficient for real systems
  • Proceeding to a multilayer thermal model
  • Active die thickness
  • metalization/insulation
  • chip-package interface
  • package
  • heatsink
  • Requires searching of several materials/
    dimensions and thermal properties
  • Multiple layers ?
  • Multiple T nodes ? Multiple DEs
  • Baseline Heat removal Structure

80
Physical Structure vs. Thermal Model
Ambient Temperature
Ambient Airflow
Heatsink
Thermal Grease
Heat Spreader
Package
Die
81
Analytical Derivation
  • 4 Nodes ? 4 DEs
  • 1) Tspr

82
EX Ppro Thermal Model
  • Use CASTLE Joseph, 2001 computed component
    powers
  • Determine component areas from Die photo
  • Determine processor/packaging physical parameters
  • Generate numerical thermal model
  • Apply component difference equations recursively
    along power flow

83
Simulation Outputs
  • Thermal nodes updated every ?t20ms
  • Component Temperatures Build up to 350K in 5hrs
  • Theatsink moves very slowly as expected

84
Verification of Results (1/3)
  • Full validation requires comparing component
    power estimations to real measured component
    powers for all demonstrated full execution traces
  • No such published data available
  • Is it possible to acquire such data?
  • How to probe intra-chip components?
  • Closest would be Hspice simulations
  • Probably infeasible to acquire traces of minutes
  • No P4 power simulator
  • 1 benchmark would take 1 CPU-month with current
    power simulation speeds

Questions Rebuttals
85
Verification of Results (2/3)
  • We use proxies
  • 1) Verifying against total measured power at
    runtime
  • Provides us with an immediate comparison
  • Most tested benchmarks show close approximations
  • 2) Behavior of component powers for simpler
    benchmarks
  • We show power trends follow expectations under
    different corners of execution

Questions Rebuttals
86
Verification of Results (3/3)
  • Why not an RMS error measure between measurement
    and estimate for total power not included?
  • As two sources of data are completely
    independent, i.e. measurement branch coming from
    serial interface and modeling part from ethernet,
    they are not perfectly synchronized. There are
    spikes at power jumps
  • Removing those by hand would question fidelity

Questions Rebuttals
Art RMS error (overall) 4.38W Art RMS
error (100-1000s) 4.21W
87
Possibility for Other Processors?
  • Most recent processors are keen on power
    management
  • There will be enough power variability to exploit
    for power modeling and phase analysis
  • Porting the power estimation to other
    architectures
  • Requires significant effort to
  • Define power related metrics
  • Implement counter reader and power estimation
    user and kernel SW
  • Porting to same architecture, different
    implementation
  • More straightforward
  • Reevaluate max/idle/gated power estimates
  • Experiences with other architectures
  • Castle project for Pentium Pro (P6)
  • Few watts of variation
  • Low dimensionality
  • IBM Power3 II
  • Very low measured power variation

Questions Rebuttals
88
Event Counter Wish List (1/2)
  • Problems from experience
  • Memory Related Metrics
  • L2 metrics are complicated
  • Do not correspond to L2 hits/misses (see
    optimization reference manual)
  • Granularity issues
  • MOB accesses metric?
  • Memory Control?
  • Integer and FP accesses
  • FPE has 8 separate events (with 2 dedicated
    ESCRs)
  • Need at least 4 rotations to collect
  • INTE has no direct measure
  • Cannot differentiate multiply, shifts, logic,
    arithmetic
  • Out of order engine
  • Cannot differentiate between
  • Allocate, Rename, Instruction Queues, Schedule

Questions Rebuttals
89
Event Counter Wish List (2/2)
  • Suggestions for Future
  • Ultimately
  • Specific counters that represent component-wise
    utilizations
  • Switching bookkeepers for singly ended lines
  • Specific to P4 Xeon Generation
  • Metrics directly related to memory components
  • Single aggregate metrics for FP and Int execution
  • Metrics that explore out of order engine
    components

Questions Rebuttals
90
Applications of our Technique
  • Already discussed
  • Power phase analysis
  • Thermal modeling
  • In addition
  • OS based energy accounting
  • i.e. ECOSystem Duke C. Ellis et al.
  • Fine grained CPU power (rather than ON/OFF)
  • Detailed SW energy mappings
  • i.e. PowerScope CMU Flinn Satyanarayanan
  • Dynamic Power Management
  • i.e. Process Cruise Control Weissel Bellosa
  • Event driven / Power model driven clock scaling

Questions Rebuttals
91
Statistical Methods (1/2)
  • Several optimization/curve fitting tools
  • JMP, R, SPSS, S, Stata
  • Regression
  • Used by some previous examples
  • 2k factorial design
  • K factors, each at 2 levels (Hi Lo)
  • Run design at each 2k corners
  • Analyze the interactions and individual factors
    effects
  • Turn factors ON/OFF

Questions Rebuttals
92
Statistical Methods (2/2)
  • We believe, we can produce much closer matching
    to measured total power with curve fitting
    techniques and PCA (to define a new set of
    orthogonal activity factors)
  • Topic for future research
  • We avoid to preserve the binding of activity
    factors to physical component powers as required
    for thermal analysis
  • Note on 2k factorial design
  • Our factors are
  • Access rates OR Component power estimations
  • Our response variable
  • Measured total power
  • However
  • We cannot independently turn ON/OFF the accesses
    to individual components

Questions Rebuttals
Write a Comment
User Comments (0)
About PowerShow.com