An Energy Efficient Instruction Set Synthesis Framework for Low Power Embedded System Designs Allen - PowerPoint PPT Presentation

Loading...

PPT – An Energy Efficient Instruction Set Synthesis Framework for Low Power Embedded System Designs Allen PowerPoint presentation | free to view - id: c6d27-ODgzY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

An Energy Efficient Instruction Set Synthesis Framework for Low Power Embedded System Designs Allen

Description:

An Energy Efficient Instruction Set Synthesis Framework for Low Power Embedded ... framework utilizes state-of-the-art instruction encoding techniques leverage ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 47
Provided by: Jul797
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: An Energy Efficient Instruction Set Synthesis Framework for Low Power Embedded System Designs Allen


1
An Energy Efficient Instruction Set Synthesis
Framework for Low Power Embedded System
DesignsAllen C. Cheng, Student Member,
IEEE, and Gray S. Tyson, Member IEEE
  • Mayumi KATO

2
Outline
  • Motivation
  • Embedded workload analysis
  • Synthesis framework
  • Power modeling
  • Experiments
  • Results and analysis
  • Comparison
  • Possible Improvement
  • Question

3
Motivation
  • Rapid market growth
  • PDAs, digital cameras, MP3 plays, and mobile
    personal communicators
  • Applications
  • More instructions
  • A new system architecture
  • Challenging cost, performance, power demands

4
Framework based Instruction-set Tuning Synthesis
(FITS) technique
  • A tunable, general-purpose processor solution
  • meet the code size, energy, and time to market
    constraints with minimal impact on area
  • applicable to a wide range of microprocessor
    designs

5
Features (FITS)
Delays instruction set synthesis until after
processor fabrication
The data path of a FITS processor is similar to
single processor platform, but map only a
subset of numerous functions to the synthesized
instruction set
6
  • Replaces the fixed instruction and
  • register access decoder with programmable
    decode
  • Programmable decoder can optimize
  • instruction encoding, address modes, operand, and
    immediate bit widths to match the requirements of
    the target application

7
Benefit of FITS
  • Reduces the code size by synthesizing 16 bit
    length instructions instead of 32 bit ISAs
  • Reduces energy consumption by deactivating the
    part of the data path
  • Reduces cost and time to market for new products
    by utilizing a single processor platform access a
    wide range of application

8
Contributions - threefold
  • (1) Provides thorough analyses of the
    characteristics of embedded applications
  • Shows that a 16-bit instruction set is
    sufficient for these high-performance embedded
    applications

9
  • (2) Present a novel cost-effective 16 bit
  • instruction set synthesis framework
  • optimized for modern RISC pipelined
  • architecture

10
(3) Demonstrate the effectiveness of this
framework utilizes state-of-the-art instruction
encoding techniques leverage compact energy
efficient, and high performance designs
(comparable to an ASP with substantially less
engineering cost)
11
Embedded workload analysis
  • Representive Mibench suite
  • Complied into the ARM binary using the GCC tool
    chain
  • Opcode Space Requirement
  • Operand Space Requirement
  • Immediate Space Requirement
  • Physical Register Space Requirement

12
Opcode space requirement
  • of different instruction
  • 16 out of 23 programs (69.6 percent) 27 or less
    opcodes
  • 7 out of all (30.4 percent) 32 to 40 opcodes
  • Ignoring less than 5 of dynamic instructions
  • 20 out of 23 programs 13 opcodes 20 opcodes
  • 55.6 ? less than 1 of total dynamically
    executed instructions

Software implementation
13
Operand space requirement
  • of register operands
  • Instructions with three addresses prevail in
    popular 32-bit RISC matches
  • Two operands are enough
  • Analysis
  • ? static statistic are important from a code
    size viewpoint
  • ? dynamic statistic help are group power
    dissipation

14
Static profiling operand analysis
  • Fraction of all static three-address ARM integer
    instruction that only need two addresses 19 to 88
    of the time
  • Intermixing two-address and three-address
    instruction (NOT load, store, swap)

15
Dynamic profiling operand analysis
  • Fraction of all dynamic three-address ARM integer
    instruction that only need two addresses
  • 59 of 87 of the time
  • NOT add with carry, subtract with carry reverse,
    subtract with carry, swap load, store

16
Immediate Space Requirement
  • Three categories
  • Branch immediate (PC offset)
  • ALU immediate (regular arithmetic, logic
    operations, shift operations)
  • Memory (load/store) immediate (base
    displacement)
  • Static profiling immediate analysis
  • To determine the size of immediate operand in
    instructions
  • Dynamic profiling dynamic immediate analysis
  • To identify the most frequently used immediate
    constants

17
Static profiling immediate analysis
  • 71 of all instructions contains immediate values
    on average

71
Total code size
Branch instructions use the largest number
of unique immediate constants
30.7
23.5
16.8
ALU
memory
branch
Static utilization
18
Dynamic profiling dynamic immediate analysis
  • 97.7 of all executed instructions is immediate
    instructions
  • ALU 53.9
  • Memory 32.2
  • Branch 11.7
  • of unique immediate constants

19
Physical Register Space Requirement
  • The smaller the number of physical registers, the
    better the performance of an instruction set
  • Analysis
  • The goal is to find the maximum number of
    physical registers that an application may need

20
Register space analysis
  • MIRV compiler compiles Mibench and profile it
  • Look at the register usage for a processor call,
    and determined the maximum usage
  • 9 to 82 (ave. 26, median 20)
  • 5 to 26 (ave. , median 17)

21
Synthesis framework
  • FITS methodology
  • FITS design flow
  • Synthesis heuristic
  • Synthesis results

High performance and code density by customizing
the instruction set to a target application
32 bits to 16 bits
50 of code size reduction while achieving the
32-bit high performance
22
FITS methodology
  • An application-specific hardware, software codes
    approach
  • matches micro-architectural resources to
    application performance needs, while improving
    codes density
  • An application-specific customization at the
    instruction set level utilizing programmable
    decoders for instruction decode and register
    access

23
  • Use profile information
  • Explore new optimization heuristics using static
    dataflow information to perform the code
    transformation
  • Specify configuration information including the
    register organization and instruction decoding
    and download it to a nonvolatile state in the
    FITS processor

24
FITS design flow
Application
Profile
  • Design flow
  • Profile
  • Synthesize
  • Compile
  • Configure
  • Execute

Synthesize
Compile
Configure
FITS Binary
Execute
no
Acceptable?
yes
25
Profile
  • The target application is analyzed by the FITS
    profiler to extract its characteristics
  • The output is a list of extensive requirement
    analysis
  • (an instruction set, opcode field, operand
    field, immediate field, and register pressure)

26
Synthesize/Compile/Configure/Execute
  • Code generation is completed
  • The programmable decoder is configured using the
    instruction decoding and register organization
  • Any unused datapaths are turned off at this stage
    to save power consumption
  • Execute the FITS binary on a FITS processor

27
Synthesis heuristic
  • Instruction mapping
  • Base Instruction Set (BIS)
  • Instructions found across all application
  • Supplemental Instruction Set (SIS)
  • Instructions required to make the instruction set
    tuning-complete
  • Application specific instructions (AIS)
  • The set of functional units in the
    microarchitecuture

28
Synthesis results
  • MiBench
  • embedded applications
  • Simple Scalar toolset
  • Examine the quality of the synthesized
    instruction set

29
Synthesized immediate
  • Select 16 synthesized ALU immediate
  • 4 bit immediate 96.9 (reference made to ALU)
  • 51.8 of the contribution ( of reference made)
  • The most frequently referenced immediate zero
  • 87.4 of the total references
  • 16 synthesized immediate
  • 26.9 of the memory immediate references
  • Value zero

Problem applications including a lot of
floating-point memory immediate are problematic
and made exception
30
Final instruction formats
Writing result
Source register
op
RC
RA
OPRD
Source operand
operand
RA
RB
IMM
op
memory
DISP
op
branch
NUMBER
op
trap
interrupts exeception task switching
31
Power modeling
  • P ACV2 f VIleak

A - gate actively switching C - total
capacitance V - supply voltage f - system
operating frequency I - leak
32
Power modeling tool
  • A natural solution
  • ? build a power estimator

Sim-panalyzer an infrastructure for
microarchitectural power
simulation at the architectural level and
built on top of the
SimpleScalr-ARM simulator
33
Experiments
  • To show the effectiveness of FITS in reducing
    power dissipation
  • Microprocessor
  • Intels SA-1100 StrongARM embedded
    microprocessor

Benchmark MiBench
34
Results and analysis
  • Metrics
  • Instruction mapping rate
  • Code size saving
  • Power reduction
  • Performance measurement
  • Instruction Mapping Coverage
  • Code Size Benefits
  • Power Dissipation Benefits
  • Performance Benefits

35
Instruction mapping coverage
  • Map ARM to FITS
  • 96 on average for static mapping
  • smaller code size
  • fewer cache size
  • 98 on average for dynamic mapping
  • Greater power reduction
  • Faster execution

FITS does not address code having a larger
fraction floating point
36
Code size benefits
  • Compares the program code density
  • ARM (32bits), THUMB(16bits), FITS(mixed 16 and 32
    bits)
  • Code Savings
  • THUMB 33 of ARM
  • FITS 47 of ARM
  • Benefits
  • Half-sized FITS can hold twice as large as before
  • Higher cache hit rate

37
Power dissipation benefits
  • Instruction cache power consumption
  • Dynamic power
  • Switching
  • internal
  • Static power
  • Peak power

38
Switching power
  • Sensitive to the power consumed by the amount
    of output data during each cache access or switch

Power consumed by the output driver and its
output load capacitance of the instruction cache
microarchitecture
39
Internal power
40
Instruction cache power saving
41
Performance benefits
  • Cache miss rate (misses per one million
    cache access)
  • instructions per cycle (IPC)
  • SA-1100 a dual-issue, in-order machine

42
Cache miss rate
  • FITS8 caches have smaller miss rates than the
    normal full-sized ARM 16 cache
  • ? The cache lines can be viewed as being twice
    the size
  • ? Twice the number of instructions are brought
    into the cache

43
Instruction per cycle (IPC) rate
  • IPC performance of FITS codes is comparable to
    that of native ARM codes
  • An 8KB FITS cache could achieve roughly the same
    IPC as a 16KB ARM cache with only few minor
    variations
  • ? This is a result of high instruction mapping
    rates and low cache miss rates

44
Comparison with state-of-the-art
  • Kadri et al. (code compression)
  • Xtensa (customizable embedded processor)
  • Thumb, Thumb-2, MIPS16, ST100, ARCtangent-A5
    (dual instruction set processor 16bits and 32
    bits)

45
FITS features different
  • Single 16 bits ISA scheme
  • Only map a subset of instructions that a
    particular program needs to the 16-bit
    instruction format

46

Possible improvement
  • Each application needs selection of operations
  • and storage components which may be different
  • from others.
  • ? After a program is loaded, it is possible to
  • aggressively design the microarchitecture

47
Possible improvement for Java memory compression
  • Low power consumption
  • Smaller cache size and smaller cache miss rate
  • High hit rate with available cache
  • Small memory demand of code
  • half (1/2) of the code storage is expected be
    saved by compacted code, which will work as a
    memory compression.

48
  • question
About PowerShow.com