General Overview of An Adaptive Dynamic Extensible Processor - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

General Overview of An Adaptive Dynamic Extensible Processor

Description:

Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions ... Moving instructions, should not modify the logic of the application ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 54
Provided by: hamid7
Category:

less

Transcript and Presenter's Notes

Title: General Overview of An Adaptive Dynamic Extensible Processor


1
General Overview of An Adaptive Dynamic
Extensible Processor
  • Hamid Noori, Kazuaki Murakami, Koji Inoue
    Victor Goulart

Kyushu University Department of Informatics
Workshop on Introspective Architecture (WISA06)
2
Agenda
  • Background
  • Research goal
  • General overview of the architecture
  • Modes of operation
  • Profiler
  • Accelerator
  • Sequencer
  • Generation of Custom Instructions
  • Configuration Data for the Accelerator
  • Experiments and Results
  • Conclusions Future work

3
Background
GPP ASIC ASIP Ext. Proc. Our Proc.
Power consumption ? ? ? ?
Performance (Specific) ? ? ? ?
Performance (General) ? ?
Flexibility ? ?
Design time ? ? ?
Design cost ? ? ? ?
Programmability ? ? ? ?
Productivity ? ? ? ?
4
Some definitions
  • Hot Basic Block (HBB)
  • A basic block which execution frequency is
    greater than a given threshold specified in the
    profiler
  • Custom Instructions (CIs)
  • Are the extended Instruction Set Architecture
    (ISA) that are executed on the ACC
  • Accelerator (ACC)
  • Custom hardware for executing CIs
  • Training mode
  • Operation mode for detecting HBBs and generating
    CIs
  • Normal mode
  • Normal operation mode where CIs are executed on
    the ACC

5
Research Goal
  • Proposal of an Adaptive Dynamic Extensible
    Processor for Embedded Systems
  • Custom instructions are adaptable to the
    applications
  • Custom instructions are detected and created
    during execution/training
  • Generation of custom instruction are done
    transparently and automatically
  • Advantages of the novel approach
  • Higher performance than GPPs
  • Higher flexibility compared to Extensible
    Processors
  • Shorter TAT and cheaper design and verification
    cost compared to ASIPs and Extensible Processors

6
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and ACC
Profiler
Execute
ACC
Memory
Sequencer
Write
Executes Custom Instructions
7
General overview of the architecture
  • Modes of operation
  • Training mode
  • Profiling
  • Detecting start address of Hot Basic Blocks
    (HBBs)
  • Generating Custom Instructions
  • Generating Configuration Data for the ACC
  • Binary rewriting
  • Initializing the Sequencer Table
  • ? Online
  • Needs a simple hardware for profiling
  • All tasks are run on the base processor
  • ? Offline
  • Needs a PC trace after taken branches/jumps
  • Normal mode
  • Profiling (optional)
  • Executing Custom Instructions on the ACC and
    other parts of the code on the base processor

8
Components
DMA
Cache
Register File
Multi-Context Memory
ID/EXE Reg
Functional Unit
Online Training
Accelerator
Sequencer
Sequencer Table
Mux
Profiler
Profiler Table (HWT)
Augmented HW
EXE/MEM Reg
GPP
9
Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
10
Profiler
Profiler Table
Current PC
Previous PC
Basic Block Start Addr (BBSA) Counter




Compare
No
If greater than instruction length
Nothing
Yes
After a taken branch or jump we look at the BBSA
to see if the target PC is on the table. If it is
a miss we include this address and initialize the
counter to 1, otherwise we increment its value.
Is Current PC in the table?
Yes
No
Increment the counter
Add it as a new entry and set the counter to one.
11
Detecting Start Addr of HBBs
HBB
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 sw 0,0(29)
  • 400d28 addu 4,0,0
  • 400d30 addu 7,0,0
  • 400d38 lui 9,49152
  • 400d40 sll 4,4,0x2
  • 400d48 and 2,8,9
  • 400d50 bne 2,0,400db8 ltusqrt0xa8gt
  • 400d58 srl 2,2,0x1e
  • 400d60 lw 3,0(29)
  • 400d68 addu 4,4,2
  • 400d70 sll 8,8,0x2
  • 400d78 sll 6,3,0x1
  • 400d80 sll 3,3,0x2
  • 400d88 addiu 3,3,1
  • 400d90 sltu 2,4,3
  • 400d98 sw 6,0(29)

BBSA Counter




BTA
400db8
50
Counter gt Threshold
Profiler Table
Taken Freq
Not taken part
HBBSA Counter




400d10
500
400db8
500
X
HBB Table
sub
Exec Freq
Hot?
Threshold 100
12
Size of Profiler Table
Number of Basic Blocks with Exec Freq more than
Threshold
Exec Freq Threshold 128 256 512 1024 2048
adpcm (enc) 28 28 28 28 28
basicmath 126 125 121 120 118
cjpeg 290 216 192 127 114
djpeg 163 154 108 48 35
lame 1109 978 929 852 537
dijkstra 117 116 103 101 101
patricia 290 290 255 228 216
blowfish 87 87 84 23 17
rijndael(enc) 107 107 106 37 37
sha 73 73 61 17 13
crc 37 37 36 36 36
fft 68 68 65 65 65
gsm 364 362 329 328 319
13
Accelerator (ACC)
  • ACC is a matrix of Functional Units (FUs)
  • ACC has a two level configuration memory
  • A multi-context memory (keeps two or four config)
  • A cache
  • FUs support only logical operations,
    add/subtract, shifts and compare
  • ACC updates the PC
  • ACC has variable delay which depends on size of
    Custom Instruction

14
Connecting ACC to the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
ACC
Sequencer
EXE/MEM Pipeline Registers
15
Connecting ACC to the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
ACC
Sequencer
EXE/MEM Pipeline Registers
16
Sequencer
  • The sequencer mainly determines the microcode
    execution sequence.
  • Selects between decoder and config memory for
    reading RF
  • Selects between the output of Functional Unit and
    Accelerator
  • Distinguishes when to switch between different
    contexts of multi-context memory
  • Determines when to load configuration data from
    cache to multi-context memory.
  • Checks the configuration data of custom
    instruction
  • If it is in multi-context memory, custom
    instructions will be executed on the accelerator
  • If it is not in multi-context memory
  • If there is enough time to load it from cache to
    multi-context memory, loads it and execute CI on
    the ACC
  • If there is not enough time, the original code is
    executed.

17
Generation of Custom Instructions
  • Custom instructions
  • Exclude floating point, multiply, divide and load
    instructions
  • Include at most one STORE, at most one
    BRANCH/JUMP and all other fixed point
    instructions
  • Simple algorithm for generating custom
    instructions
  • HBBs usually include 1040 instructions for
    Mibench
  • Custom instruction generator is going to be
    executed on the base processor (in online
    training mode)

18
Generating Custom Instructions
  • 4052c0 addiu 29,29,-32
  • 4052c8 mov.d f0,f12
  • 4052d0 sw 18,24(29)
  • 4052d8 addu 18,0,6
  • 4052e0 sw 31,28(29)
  • 4052e8 sw 16,16(29)
  • 4052f0 mfc1 16,f0
  • 4052f8 mfc1 17,f1
  • 405300 srl 6,17,0x14
  • 405308 andi 6,6,2047
  • 405310 sltiu 2,6,2047
  • 405318 addu 6,6,18
  • 405320 sltiu 2,6,2047
  • 405328 lui 2,32783
  • 405330 and 17,17,2
  • 405338 andi 2,6,2047
  • 405340 sll 2,2,0x14
  • 405348 or 17,17,2
  • 405350 mtc1 16,f0
  • Finding the biggest sequence of instructions in
    the HBB that can be executed on the ACC
  • Moving the instructions and appending supportable
    instructions to the head of the detected
    instruction sequence after checking
    flow-dependency and anti-dependency
  • Moving the instructions and appending supportable
    instructions to the tail of the detected
    instruction sequence after checking
    flow-dependency and anti-dependency
  • Rewriting object code if instructions have been
    moved
  • Moving instructions, should not modify the logic
    of the application
  • Custom instruction generation is done without
    considering any other constraints.

19
Generating Custom Instructions
  • Block 3 (B3) is selected as the biggest
    instructions sequence that can be executed on the
    ACC
  • Block 2 (B2) can not be executed on ACC
  • Block 1 (B1) can be executed on ACC
  • If there is no flow and anti-dependency between
    B1 and B2 exchange them.
  • This is done for B4 and B5.

20
Example 1
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 sw 0,0(29)
  • 400d28 addu 4,0,0
  • 400d30 addu 7,0,0
  • 400d38 lui 9,49152
  • 400d40 sll 4,4,0x2
  • 400d48 and 2,8,9
  • 400d50 srl 29,2,0x1e
  • 400d58 lw 3,0(29)
  • 400d60 addu 4,4,3
  • 400d68 sll 8,8,0x2
  • 400d70 sll 6,3,0x1
  • 400d78 sll 3,3,0x2
  • 400d80 addiu 3,3,1
  • 400d88 sltu 2,4,3
  • 400d90 sw 6,0(29)
  • 400d98 bne 2,0,400db8 ltusqrt0xa8gt

Customized Instruction 1
Customized Instruction 2
21
Example 2 (rewriting obj code)
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 addu 7,0,0
  • 400d28 lui 9,49152
  • 400d30 sll 4,4,0x2
  • 400d38 and 2,8,9
  • 400d40 srl 2,2,0x1e
  • 400d48 lw 22,0(29)
  • 400d50 addu 4,4,2
  • 400d58 sll 8,8,0x2
  • 400d60 sll 6,3,0x1
  • 400d68 sll 3,3,0x2
  • 400d70 sltu 2,4,3
  • 400d78 bne 2,0,400db8 ltusqrt0xa8gt

22
ACC Config Data Generation Flow
Base Processor
Mibench Applications
Simplescalar (PISA Configuration)
Profiler
Detecting Start Addr of HBBs
Reading HBBs from Obj Code
DFG
23
Preliminary Performance Evaluation
  • 400d10 addiu 29,29,-8
  • 400d18 addu 8,0,4
  • 400d20 sw 0,0(29)
  • 400d28 addu 4,0,0
  • 400d30 addu 7,0,0
  • 400d38 lui 9,49152
  • 400d40 sll 4,4,0x2
  • 400d48 and 2,8,9
  • 400d50 srl 2,2,0x1e

9 2 7 clock cycles 7 freq reduced clock
cycles 7 50K 350K clock cycles
Depth 3
1st row 1 clock
0.5 clock
0.5 clock
Total 2 clock
24
Results Number of CI considering their length
82
Length of CIs
25
Results Percentage of CIs considering their
length
Length of CIs
26
More info on Custom Instructions
App. Exe Instr (M) Threshold (K) HBB CI Speedup code size exec time
basicmath_large 170 64 37 18 19.6 1.4 31.6
cjpeg 101 32 42 52 27 1.5 44
djpeg 25 8 22 32 31.5 0.8 48
lame 260 32 142 104 8.6 1.1 16
dijkstra 254 64 34 20 21.4 0.7 38.6
patricia 217 128 51 17 7.8 0.6 14.6
blowfish 260 128 18 28 33 2.7 59
rijndael (enc) 260 128 63 92 36 6.1 51.7
rijndael (dec) 259 128 63 78 36 4.5 51.7
sha 154 64 9 13 52 1.1 73
adpcm (enc) 260 2000 14 8 21 0.32 42
adpcm (dec) 265 2000 12 5 24 0.24 41
crc 265 512 4 2 20 0.1 44.9
fft 189 128 43 19 18.6 0.93 30
fft (inv) 190 128 43 19 18.6 0.93 30
gsm (cod) 265 128 34 41 25.1 1.53 47.2
Average 39 34 25 1.53 41.45
27
Conclusions
  • An Adaptive Dynamic Extensible Processor
  • Training mode and Normal mode
  • Advantages
  • It has s simple profiler
  • CI are detected and added after production
  • There is no need to a new compiler
  • There is no need to new opcode for CIs
  • There is no penalty for absence of CI config data
  • Lower design cost and shorter design time
  • By accelerating a small part of code which has a
    high execution frequency an average 25 speedup
    improvement can be obtained. Comparing a single
    issue processor speedup improvement ranges from
    7.8 to 52.

28
Future Work
  • Linking HBBs
  • Providing more details on the architecture
    (Accelerator, sequencer, etc)
  • Designing an Accelerator to support conditional
    execution
  • Developing a complete framework
  • Extending ACC for floating point operations
  • Substituting the in-order base processor with an
    out-of-order

29
  • Thank you for your listening

30
Example
  • Application X
  • CIx1, 100, input 3
  • CIx2, 200, input 6
  • Total executed instruction 400,000
  • Application Y
  • CIy1, 50, input 4
  • CIy2, 400, input 6
  • Total executed instruction 800,000
  • Input lt 5

31
Mapping Tool - Example
32
RFU Design A Quantitative Approach
  • RFU or Accelerator is a matrix of ALUs
  • No of Inputs
  • No of Outputs
  • No of ALUs
  • Connections
  • Location of Inputs Outputs
  • Some definitions
  • Considering frequency and weight in measurement
  • CI Execution Frequency
  • Weight (To equal number of executed instructions)
  • Average for all CIs (SFreqWeight)
  • Rejection Percentage of CI that could not be
    mapped on the RFU
  • Coverage Percentage of CI that could be mapped
    on the RFU
  • Basic Blocks A sequence of instructions
    terminates in a control instruction
  • Hot Basic Blocks A basic block executed more
    than a threshold

33
RFU Inputs (no constraint)
96.37
89.37
98.48
8
34
RFU Outputs (no constraint)
96.58
6
35
RFU Node No (Input8, Output8)
94.74
16
36
RFU Width (Inp8, Out8, Node16)
97.65
95.65
6
37
RFU Depth (Inp8, Out8, Node16)
93.41
6
38
RFU Configuration
  • Input8
  • Output8
  • Node16
  • Width 6,4,3,2,1
  • Depth 5

39
General overview of RFU (Architecture 1)
  • Inputs are applied to the first row
  • Outputs of each row are connected only to the
    inputs of the subsequent row
  • MOVE is used for transferring data
  • Rejection is 22.47

40
General overview of RFU (Architecture 2)
  • Distributing Inputs in different rows
  • Row1 7
  • Row 2 2
  • Row 3 2
  • Row 4 2
  • Row 5 1
  • Connections with Variable Length
  • row1 ? row3 1
  • row1 ? row4 1
  • row1 ? row5 1
  • row2 ? row4 1
  • Rejection is 9.52

41
Functional Units
  • Types for FUs
  • Type1 Logical (xor, nor, and , or)
  • Type2 add, sub, compare
  • Type3 shift (left/right)
  • Number of each type in the RFU
  • Type 1 6
  • Type 2 14
  • Type 3 9

42
RFU with 8 outputs
Accelerator
Reg
Reg
Reg
Reg
FU2-Output
FU4-Output
FU1-Output
FU3-Output
Sequencer/control bits
Sequencer/control bits
43
Control Bits Immediate Data
  • 287 bits are needed as Control Bits for
  • Multiplexers
  • Functional Units
  • 204 bits are needed for Immediates
  • Each CI configuration needs (247204 491 bits)

44
CI Configuration Memory
  • 2K x 1-bit multi-context memory ? 4 CI
    configuration
  • 8K x 1-bit cache ? 16 CI configuration
  • Total 20 CI configuration can be kept in
    configuration memories

45
Extension of Custom Instructions over HBBs
Motivating Example
Name of the block No. of Exe. (M) No. of Instr
B1 11.6 5
B2 5.8 1
B3 5.8 4
B4 8.6 3
B5 5.2 3
B6 5.6 1
B7 5.8 2
B8 11.6 2
B9 11.6 6
B10 11.6 2
B11 11.6 4
B12 5.8 3
46
Multi-Exit Custom Instructions
47
Conclusions
  • Adaptive Dynamic Extensible Processor
  • Binary Profiler
  • RFU (Inp8, Out6, Nodes16, Width6,4,3,2,1 -
    Depth5)
  • Sequencer
  • Adaptive Dynamic Extensible Processor
  • No design time
  • No extra read port and write port
  • No design and verification cost
  • No compiler
  • No new opcode
  • No penalty for absence of configuration data of
    custom instruction in multi-context memory.

48
Custom Instruction
  • Generated from HBBs
  • Using HBB table
  • Object code
  • Custom instruction can include
  • logical operations
  • add/sub
  • Shift
  • At most one store
  • At most one control instruction (jump/branch)
  • No load
  • No floating point instructions
  • New object code
  • Logically is equivalent

BBSA Counter




Profiler Table
49
Processor modes (1/2)
  • Training mode
  • Profiling applications
  • Detecting critical region of code
  • Generating DFG for critical regions
  • Generating custom instruction from DFGs
  • Generating new object code
  • Generating data for accelerator configuration
    memories and initializing sequencer table
  • Training can be done at the gap between two
    consecutive execution of the application if
    possible, otherwise just once before processor
    starts its normal operation

50
Processor modes (2/2)
  • Normal mode
  • Profiling applications
  • Using the data generated in training mode to
    execute custom instructions on the accelerator.
  • Critical regions of the code are executed as
    custom instructions on the accelerator and the
    remaining part of the code are executed
    deploying the processor functional unit as usual.

51
Online Profiler-Components
  • Profiler
  • Hardware
  • Software
  • Hardware
  • Comparator compares current value and previous
    value of Program Counter (PC).
  • Profiler Table In this table for each taken
    branch/jump target address, there is a
    corresponding counter. The counter, counts how
    many taken branch or jumps has been done to the
    target address.
  • Software
  • Hot Basic Block (HBB) detector
  • Basic block is a sequence of instructions that
    ends up in a branch or jump.

52
Architecture Advantages
  • No compiler
  • No new opcode
  • No penalty for absence of configuration data of
    custom instruction in multi-context memory.
  • The ability to use processor functional unit and
    accelerator in parallel.
  • Custom instruction detection and execution are
    done fully automatically and transparently.

53
General overview of the architecture
  • Base processor (1,2 or 4-way in-order general
    RISC)
  • Profiler
  • Detects start address of Hot Basic Blocks (HBBs)
  • Accelerator (ACC)
  • Executes Custom Instructions
  • Sequencer
  • Determines the microcode execution sequence using
    the sequencer table
Write a Comment
User Comments (0)
About PowerShow.com