Title: Effective Compilation Support for Variable Instruction Set Architecture
1Effective Compilation Support for Variable
Instruction Set Architecture
- Jack Liu
- Timothy Kong
- Fred Chow
- Cognigine Corp.
- www.cognigine.com
2Outline
- VISC Architecture
- Compile-time Configurable Code Generation
- Managing the Dictionary
- Concluding Remarks
3Configurable Computing
- Motivation
- Higher performance
- processor and instruction set customized to type
of application - Lower hardware cost
- non-essential features excluded
- Shorter time-to-market
4Variable Instruction Set Architecture (VISC
ArchitectureTM)
- A new approach to configurable computing
- Fixed processor hardware
- Many types of operations provided
- Numerous instruction variants (CISC-style)
- Per-program instruction set tailoring during
compile time
5Background of this work
- Cognigine CGN16100 Network Processor
- Single-chip, fully programmable network processor
- Processing cores
- 16 Re-configurable Communications Units (RCU)
processor cores - VISC architecture
- 4 64-bit parallel execution units
- Multi-threaded
- 512 KB on-chip memory (text and data)
6VISC ArchitectureTM
Dictionary (instruction set for current program)
dictionary entry 32-bit 2 operations 64-bit 4
operations 128-bit 8 operations
256 entries
instruction
opcode opnd0 opnd1 opnd2 opnd3
opcode 8-bit
7Motivation for VISC Architecture
- Efficient way to encode/decode the many operation
variants with different addressing modes - Not all used in each program
- High instruction encoding density
- Small opcode bit count
- Operands shared among multiple operations
- Simplified control logic for VLIW-style ILP
- Up to 8 operations per cycle
8Operation Specification
- In Dictionary Entry (only specified once)
- Operation name
- Operation variants
- Signed and unsigned
- Operand and result sizes 8-bit, 16-bit, 32-bit,
64-bit - Support different sizes among operand(s) or
result - Vector 64v8, 64v16, 64v32, 32v8, 32v16
- Data path to each operand/result
- In Instruction
- Operands encoding formats
- Actual operands
9RCU Architecture
- 5 Stage Pipeline
- 4-way multi-threaded
- Hardware RSF synchronization
- 128 bit reconfigurable address path
- 256 bit reconfigurable data path
10Roles of Compiler for VISC Architecture
- Determine best instruction set stored in
dictionary for best execution time performance - Generate optimized code sequence based on best
instruction set - Cater to various hardware limitations
- Dictionary limit
- Data path constraints
- Dictionary and Instruction encoding constraints
11New Compilation Approach Configurable Code
Generation
- Exact form of generated instructions decided in
the last instruction scheduling phase - Direct result of instruction compaction based on
what is allowed by the hardware
12Compiler Implementation Method
- Retarget SGI Pro64 (Open64) compiler to an
Abstract Machine - Code generator operates on an Abstract Operation
Representation - Code generation optimizations left intact
- Add new Instruction and Dictionary Finalization
(IDF) phase as post-pass - IDF Phase 1
- Instruction scheduling and folding
- Abstract operations converted to target code
sequence - IDF Phase 2
- Output VISC instructions and dictionary entries
13Compiler Phase Structure
C
GNU / Pro64TM Front-end
WHIRL Optimizer
Pro64TM Back-end
Code Generator
IDF
Assembly Program Instructions Dictionary
14Abstract Operation Representation (AOR)
- Each operation corresponds to a micro-operation
in the core execution units - RISC-like formats
- r1 op r2, r3
- r2 load ltoffsetgt(ltbasegt)
- store r2 ltoffsetgt(ltbasegt)
- r1 loadimm ltimmgt
- Optimizations in AOR reflected in final code
- No pre-disposition of compiler to any specific
instruction format
15Multiple AOR ops can be combined to single target
operation
- Operations taking immediate operand
- r2 move ltimmgt gt r3 addi r1 ltimmgt
- r3 add r1, r2
- Operations supporting memory operands
- r2 load 4(sp) gt r3 add r1 4(sp)
- r3 add r1, r2
- Post incre/decre memory operations
- r2 load 0(r1) gt r2 load 0(r1)
- r1 addi r1, 4
- Branches on condition codes
- r1 add r2, r3
- . . . r1 add r2, r3
- compare (r1 ! 0) gt br.z label (only if
immediately after) - br.z label
- Others
16IDF Approach
- Instruction scheduling following tasks
- Instruction folding
- Opcode selection
- Modelling of irregular hardware constraints
- Modelling of encoding constraints
- Monitoring of states of condition codes and
transient registers - Keeping track of dictionary contents
- Use enumeration (branch and bound) approach
17Example of IDF Processing
Dictionary
Input
add xor sub nop
w80 move 0x55 w91 move 0xf8 w70 add
w70, w80 w71 xor w92, w80 w90 sub w92,
w91 store 8(p1) w90
add xor sub nop
3
instruction
op3 8(p1) w70 0x55 0xf8
- move and store instructions subsumed
- w71, w92 mapped to transient registers
18IDF Scheduling Algorithm
Input Sequence of operations in BB
Estimate initial boundsch
- To speed up the search
- Shrink solution space by
- Coming up with high initial boundsch
- Prune useless search paths continuously
- Tight hardware constraints help
Search for schedule with length lt boundsch
boundsch boundsch1
no
yes
19Managing the Dictionary
- Dictionary usage increases due to
- Program size more variety of operations
- High ILP more combination of operations
- Library code linked in
- Currently, dictionary contents fixed for each
executable - Role of linker
- Merge dictionary entries with identical contents
across files/libraries - Error message on dictionary overflow
- Role of compiler
- Maximize dictionary entry re-use
20Dictionary Compilation
- Strategy
- Keep track of existing dictionary entries during
compilation - Extract dictionary entries from
- Libraries and .s files being linked
- .o files compiled before current file
- Example cc a.c b.o c.s
- Maintain table of existing dictionary entries
- Add to table as new entries are generated
- Re-use existing dictionary entries
- Bias scheduling towards dictionary conservation
as dictionary fills up
21User Control of Dictionary Compilation
- Best program performance demands near-full
dictionary. - When dictionary overflow, needs to re-compile.
- Provide user control mechanisms
- Trade-off between dictionary consumption and
program performance - Command line option -CGdict_usagen n
010 - Embedded in code pragma dict_usage n
- dict_usage is dictionary budget guideline for IDF
- Low dict_usage
- Less new dictionary entries created
- Low ILP
- High dict_usage
- Tighter instruction schedule
- More dictionary entries created
22IDF Support of dict_usage
- Additional search goal bounddict
- Number of new dictionary entries allowed for
current BB - Automatically adjust lower with more pre-existing
entries - When bounddict reached during enumeration,
disallow creating new dictionary entry (unless
single operation)
23Experimental Results
- Summary (with dict_usage10)
- ILP from IDF scheduling 1.38 ops per instruction
- ILP from relaxed scheduling 1.51 ops per
instruction - 23 of all subsumable operations subsumed
- Each dictionary entry referred to by 2.63
instructions (statically) - Scheduling via enumeration 100 times slower than
one-pass schedulers - Compilation time 1 to 2 minutes per program
24Concluding Remarks
- VISC approach most suitable as embedded
processors - Limited program size
- Dictionary space less of an issue
- Slow compilation tolerable
- CISC-style instructions enable small code size
- Compilation support key to deploying applications
on VISC - Very hard to write in assembly language
- Advanced optimizations performed by compiler
- Dictionary managed by compiler with user hints
- Compile-time configurable code generation enables
RISC compilation techniques to generate CISC
output