An InstructionLevel Power Model for the Xscale Architecture - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

An InstructionLevel Power Model for the Xscale Architecture

Description:

Oscilloscope. A Tektronix? was used for current readings. ... Oscilloscope. ARM ISA I. Six Instruction Classes were distinguished: Branch: B, BL, BX, BLX ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 43
Provided by: Seif7
Category:

less

Transcript and Presenter's Notes

Title: An InstructionLevel Power Model for the Xscale Architecture


1
An Instruction-Level Power Model for the Xscale
Architecture
  • By Hailemelekot Seifu and
  • Rishi Ranjan
  • Real-Time Systems

2
BACKGROUND I
  • Previous Work by Tiwari and others
  • Software is a major component of power
    consumption in real-time and non-real-time
    applications
  • Therefore, software must be analyzed in terms of
    power consumption to minimize energy needs of
    total systems

3
BACKGROUND II
  • The lowest level of hardware analysis is the
    logic gate
  • The instruction in software parallels the logic
    gate in hardware. Power analysis should be at
    this basic level.
  • Most work pevious to Tiwari were hardware-based

4
BACKGROUND III
  • Instruction level power model
  • Used to quantify the fundamental information
    needed to evaluate power cost of a program.
  • Can be used by compilers, code generators and
    code schedulers targeted for low power.
  • Does not require knowledge of lower level details
    of processor.

5
IMPORTANCE
  • Low Power Consumption
  • Prolongs life of embedded systems that have low
    battery/power stores.
  • Requires less maintenance and upkeep
  • Saves money

6
Applications
  • Instruction-Level Power Models
  • can give total power consumption information on a
    software system before it is deployed on a power
    constrained system
  • allow creation of software optimization
    techniques at code compilation and generation
    stages

7
Project Goals
  • Perform an instruction-level power analysis of
    the Intel Xscale-based Architecture (only on
    processor core).
  • Create a power estimator that takes a programs
    instruction trace as input and determines its
    power consumption
  • Evaluate the model and estimator using power
    measurements

8
Hardware I
  • ADI BRH 80200
  • Intel XScale (400-733MHz) 80200 CPU
  • 128MB DRAM.
  • Dual Intel EEpro/100 (10/100) ethernet ports.
  • Oscilloscope
  • A Tektronix? was used for current readings. It
    allows time granularity of 5ms and current
    granularity of 5mA.

9
HARDWARE II
Power Source
Oscilloscope
CPU
10
ARM ISA I
  • Six Instruction Classes were distinguished
  • Branch B, BL, BX, BLX
  • Data Processing ADD, ADC, BIC, CMP
  • Multiply MUL, MLA, SMLAL, UMLAL
  • Load/Store LDR, LDMDA, STR, STMDA
  • Miscellaneous MRS, MSR, MRC, MCR
  • No operation NOP

11
ARM ISA II
  • Branch instruction readings were not performed
    due to difficulty in creating programs that would
    accurately discern current caused by branches
  • All readings were taken on top of Linux operating
    system

12
Measurement Methods I
  • Power Current Voltage
  • Energy Power Time
  • Time Number of processor cycles Clock Period
  • Energy Current Voltage Time
  • Voltage is constant for a battery ( Xscale
    supports voltage scaling)
  • Measure Current
  • Measure Time

13
Measurement Methods II
  • Energy consumption has three factors
  • Base Costs
  • the energy consumed by the basic processing of
    each instruction
  • Inter-Instruction Effects
  • the energy costs due to the change in circuit
    state when two consecutive, different
    instructions are executed
  • Cache misses and stalls
  • the energy effect of instruction/data cache
    misses or pipeline stalls due to resource
    constraints

14
Measurement Methods III
  • Base Cost Calculation
  • Base instruction cost is measured and the given
    current reading is taken as an average
  • The programs contain the instruction to be
    measured repeated in a loop.
  • Avoid stalls and cache misses
  • Overcome the effect of jump instruction at bottom
  • Contradictory requirements

15
Base Cost Program Example
  • .global main
  • main
  • ADC R0,R1,R3 Cache
    Size 32 KB -gt 200 lines avoids cache misses
  • ADC R1,R1,R3 With 200
    lines, Branch instruction at the end will have
  • ADC R2,R1,R3
    insignificant effect on measured current.
  • ADC R3,R1,R3
  • ADC R4,R1,R3
  • ADC R0,R1,R3
  • ADC R1,R1,R3
  • repeat for 200 LINES
  • B main

16
Measurement Methods IV
  • Inter-Instruction Effect
  • These effects occur mostly during the fetch stage
    of the execution pipeline when the hardware
    circuit changes significantly due to setup of two
    different instructions
  • The programs are generally a loop of two
    alternating instructions

17
Inter-Instruction Program Example
  • ADD R0,R10,R11
  • MLA R1,R2,R10,R11
  • ADD R0,R10,R11
  • MLA R1,R2,R10,R11
  • Expected current (405 508)/2 456.5 mA
  • measured current 567 mA
  • difference 100.5 mA

18
Measurement Methods V
  • Resource Contraint Effects
  • Can occur for various reasons
  • Most likely is due to register dependencies among
    consecutive instructions
  • Run a program in loop having repeated occurrence
    of re-used destination registers

19
Resource Constraint Example
  • .global main
  • main
  • ADD R0,R6,R7 Avoids Cache misses
  • ADD R0,R6,R7 Same destination register
    ADD R0, R6, R7 causes
    pipeline stalls
  • ADD R0,R6,R7
  • ADD R0,R6,R7
  • repeat for 200 lines
  • B main

20
Measurement Methods VI
  • To create cache miss effects
  • Run programs with initial instructions that
    invalidate all cache entries.

21
Cache Miss Program Example
  • .global main
  • main
  • MCR P15,0,R1,C7,C5,0 32KB
    instruction cache
  • MRC P15, 0, R0, C2, C0, 0 MCR
    invalidates the cache
  • MOV R0, R0
    8000 lines to cause all

  • possible cache misses
  • SUB PC, PC, 4 8000
    lines will compensate
  • ADC R0, R1, R3 for
    the time we wait
  • ADC R0,R1,R3 for
    cache invalidation
  • ADC R0,R1,R3 MCR
    is a privileged instr
  • 8000 lines
    code as kernel modules.
  • B main

22
Base Cost Results
Instruction Base Readings (mA) ADC 426 ADD 406
AND 405 BIC 401 CLZ 400 CMN 400 CMP 474 EOR
401 MOV 398 RSB 503 LDMDA 606 SWP 537 NOP 380
23
Inter-Instruction Results
Instruction Readings (mA) ADC_LDMDA 471 CLZ_STR
530 BIC_MSR 460 STMDA_NOP 495
24
Cache Miss and Stall Results
Instruction Cache Readings (mA) ADC 456 MLA 350
SWP 393
Instruction Stall Readings (mA) BIC 384 MVN 386
TEQ 383
25
Power Model Analysis I
Base Cost
26
Power Model Analysis II
Inter-Instruction
27
Power Model Analysis III
Cache Miss
28
Base Cost
29
Base Cost Observations
  • Base Cost power for instructions in same class
    have similar base cost
  • Can be used to group together instructions to
    assign average power cost.
  • Profiler optimization.
  • Assumes insignificant change due to variations in
    arguments.
  • Same base cost for an instruction
  • Inter-instruction cost calculated between groups
    of instructions.

30
Inter-Instruction
31
Inter-Instruction Observations
  • Circuit overhead insignificant for x86(Tiwari
    work).
  • Our measurement shows that it IS significant
    compared to our base costs.
  • Power profiler should include this overhead for
    ARM architecture.

32
Resource Constraint Penalty
33
Observations
  • There is Limited variation for different
    instructions
  • Profiler would assign average power to all stalls
  • Profiler would assign penalty due to single
    occurrence multiplied by total occurrence in a
    program

34
Current Estimator I
  • The software estimator will
  • take as input a programs instruction trace,
    cache miss, and stall information
  • use the above instruction power model and
  • output the corresponding average current for that
    program

35
Current Estimator II
  • Estimators Design
  • SimpleScalarARM (Umich) was used as a starting
    point since it could give an instruction trace
    most similar to the ARM architecture
  • Cache miss and stall information would be
    obtained from a per-process Performance Counter
    facility for the ADI boards.

36
Current Estimator III
  • Several Challenges Emerged
  • Performance Counter facility is not working
  • SimpleScalars cache miss functionality was also
    not working with chosen benchmarks
  • So
  • cache miss and stall information could not be
    included in the evaluation phase.
  • Base cost and inter-instruction effects can still
    be evaluated with SimpleScalars instruction trace

37
Benchmarks
  • MIBench (Umich)
  • is a benchmark suite for embedded systems
  • has several sub-categories depending on the
    systems use such as telecommunications,
    automotive, and consumer
  • was chosen because it is an embedded suite and is
    freely available

38
Estimator Results I
39
Estimator II
  • Results are very skewed due to lack of cache miss
    and stall information
  • Other benchmarks were challenging to compile for
    ARM/SimpleScalar to function, we are working on
    other benchmarks
  • Proper handling of floating point instructions
    need to be added

40
Necessary Work - LOTS!
  • Include Cache miss and stall information
  • power profiler cant be accurate
  • Inclusion of Branch instruction in model
  • branch instruction is used for functions, etc
  • Inclusion of instr operands in power model
  • due to time constraints we only look at op
  • Improvement of Estimator performance
  • using better data structures, opcode search

41
Necessary Work II
  • Proper analysis of Stall information
  • Modelling of Load/Store inter-instruction effects
  • Data cache accesses, etc

42
QUESTIONS
Write a Comment
User Comments (0)
About PowerShow.com