A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes

Description:

... 0 ) attr(lc ^52) flags( sched ) ) Embedded Systems Group ... REBEL. HMDES. Low level C files. C libraries. Emulation Library. Executable for the host platform ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 32
Provided by: vand165
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes


1
A Framework for Studying Effects of VLIW
Instruction Encoding and Decoding Schemes
  • Anup Gangwar

November 28, 2001
2
Overview
  • The VLIW code size expansion problem
  • What all such a framework needs to support?
  • Trimaran compiler infrastructure
  • The HPL-PD architecture
  • Extensions to the various modules of Trimaran
  • Results
  • Future work
  • Acknowledgements

3
Choices for exploiting ILP
  • The architectural choices for utilizing ILP
  • Superscalar processors
  • Try to extract ILP at run time
  • Complex hardware
  • Limited clock speeds and high power dissipation
  • Not suited for embedded type of applications
  • VLIW processors
  • Compiler has lot of knowledge about hardware
  • Compiler extracts ILP statically
  • Simplified hardware
  • Possible to attain higher clock speeds

4
Problems with VLIW processors
  • Complex compiler required to extract ILP from
    application program
  • Requires adequate support in hardware for
    compiler controlled execution
  • Code size expansion due to explicit NOPs if,
  • The application does not contain enough
    parallelism
  • The compiler is not able to extract parallelism
    from the application
  • Need for good instruction encoding and NOP
    compression schemes

5
What all such a framework should support?
  • The framework should have quick retargetability
  • Studying the effect of a particular instruction
    encoding and decoding scheme on processor
    performance
  • Studying the code size minimization due to a
    particular instruction encoding scheme
  • Studying memory bandwidth requirements imposed by
    a particular instruction decoding scheme.

6
Trimaran Compiler Infrastructure
C Program
Bridge Code
IMPACT
  • ANSI C Parsing
  • Code profiling
  • Classical machine independent optimizations
  • Basic block formation

ELCOR
  • Machine dependent code optimizations
  • Code scheduling
  • Register allocation

ELCOR IR
SIMULATOR
STATISTICS
  • ELCOR IR to low level C files
  • HPL-PD virtual machine
  • Cache simulation
  • Performance statistics
  • Compute and stall cycles
  • Cache stats
  • Spill code info

HMDES Machine Description
7
Various modules of Trimaran - 1
  • IMPACT
  • Developed by UIUCs IMPACT group
  • Trimaran uses only the IMPACT front-end
  • Classical machine independent optimizations
  • Outputs a low level IR, Trimaran bridge code
  • ELCOR
  • Developed by HPLs CAR group
  • It is the compiler backend
  • Performs registration allocation and code
    scheduling
  • Parameterized by HMDES machine description
  • Outputs ELCOR IR with annotated HPL-PD assembly

8
Various modules of Trimaran - 2
  • HMDES
  • Developed by UIUCs IMPACT group
  • Specifies resource usage and latency information
    for an arch.
  • Input is translated to a low level representation
  • Has efficient mechanisms for querying the
    database
  • Does not specify instruction format information
  • HPL-PD Simulator
  • Developed by NYUs REACT-ILP group
  • Converts ELCORs annotated IR to low level C
    representation
  • Processor performance and cache simulation
  • Generates statistics and execution trace

9
Various modules of Trimaran - 3
Example ELCOR Operation in IR
Op 7 ( ADD_W brlt11 I gpr 14gt brlt27 I gpr
14gt Ilt1gt plttgt s_time( 3 ) s_opcode( ADD_W.0 )
attr(lc 52) flags( sched ) )
10
Various modules of Trimaran - 4
  • HMDES Sections
  • Field_Type e.g. REG, Lit etc.
  • Resource e.g. Slot0, Slot1 etc.
  • Resource_Usage e.g. RU_slot0 time( 0 )
  • Reservation_Table e.g. RT_slot0 use( Slot0 )
  • Operation_Latency e.g. lat1 ( time( 1 ) )
  • Scheduling_Alternative e.g. (format(std1)
    resv(RT1) latency(lat1) )
  • Operation e.g. ADD_W.0 ( Alt_1 Alt_2 )
  • Elcor_Operation e.g. ADD_W( op( ADD_W.0
    ADD_W.1 ) )

11
Various modules of Trimaran - 5
HPL-PD Simulator in detail
REBEL
Low level C files
C libraries
Emulation Library
Code Processor
HMDES
Native Compiler
Executable for the host platform
12
Various modules of Trimaran - 7
HPL-PD Simulator in detail
HPL-PD Virtual Machine
Fetch Next Instruction
Fetch Data
Execute Instruction
Instruction Accesses
Data Accesses
Dinero IV Cache Simulator
Level I Instruction-Cache
Level I Data-Cache
Level II Unified Cache
13
The HPL-PD architecture
  • Parameterized ILP architecture from HP Labs
  • Possible to vary,
  • Number and types of FUs
  • Number and types of registers
  • Width of instruction words
  • Instruction latencies
  • Predicated instruction execution
  • Compiler visible cache hierarchy
  • Result multicast is supported for predicate
    registers
  • Run time memory disambiguation instructions

14
The HPL-PD memory hierarchy
Registers
L1 Cache
Data Prefetch Cache
L2 Cache
  • Independent of L1 Cache
  • Used to store large amount of cache polluting
    data
  • Doesnt require sophisticated cache replacement
    mechanism

Main Memory
15
The Framework
Decoder Model
HMDES
TRIMARAN
Perf. Stats
ASSEMBLER (using NJMC)
Cache. Stats
Obj. File
Code Size
Instruction Address or Next Instr Request
Instruction Address
Bytes Fetched
DISASSEMBLER (using NJMC)
16
Studying impact on performance
  • The HMDES modeling of decompressor,
  • Add a new resource with latency of decoder
  • Add a new resource usage section for this decoder
  • Add this resource usage to all the HPL-PD
    operations
  • In the results there are two decompressor units
    with latency 1
  • The latency of decompressor should be estimated
    or generated using actual simulation.

17
Studying code size minimization - 1
A simple template based instruction encoding
scheme
IALU.0
IALU.1
FALU.0
MU.0
BU.0
Issue Slots
..
MUL_OP Format
MUL_OP
OPCODE OPERANDS
OPCODE OPERANDS
ADD_W and L_W_C1_C1
00010
IOP Sgpr1, Slit1, Dgpr2
MemOP Sgpr1, Dgpr1
  • Multi-ops are decided after profiling the
    generated assembly code.
  • Multi-op field encodes
  • Size and position of each Uni-op
  • Number, size and position of operands of each
    Uni-op

18
Studying code size minimization - 2
  • Instrumenting ELCOR to generate assembly code
  • 1. Arrange all the ops in IR in forward control
    order
  • 2. Choose the next basic block and initialize
    cycle to 0
  • 3. Walk the ops of this BB and dump those with
    the s_time cycle
  • 4. If BBs are left goto step 2
  • 5. Dump the global data
  • Actual instruction encoding is done using
    procedures created by NJMC

19
Studying code size minimization - 3
The New Jersey Machine Code Toolkit
  • Deals with bits at symbolic level
  • Can be used to write assemblers, disassemblers
    etc.
  • Supports concatenation to emit large binary data
  • Representation is specified in SLED
  • Has been used to write assemblers for Sparc, i486
    etc.
  • VLIW instructions need to be broken up into 32
    bit (max) size tokens
  • Emitted binary data must end on a 8 bit boundary

20
Studying code size minimization - 4
Machine specifications in SLED
bit 0 is least significant fields of TOK32 (32)
Dgpr_1 03 Slit_1_part1 431 fields of TOK8 (16)
Slit_1_part2 03 Sgpr_1 47 IOP 811 tmpl
1214 patterns IOP_pats is any of ADD MUL
SUB , which is tmpl 1 IOP 0 to 2
constructors IOP_pats Sgpr_1, Slit_1, Dgpr_1
is IOP_pats Sgpr_1 Slit_1_part2 Slit_1
_at_2831 Slit_2_part1 Slit_1 _at_027
Dgpr_1
21
Studying code size minimization - 5
Toolkit encoder output
ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned
Dgpr_1 ) MUL( unsigned Sgpr_1, unsigned Slit_1,
unsigned Dgpr_1 ) SUB( unsigned Sgpr_1, unsigned
Slit_1, unsigned Dgpr_1 )
Specifying matcher for disassembler
match ADD( Sgpr_1, Slit_1, Dgpr_1 ) gt //Do
something MUL( Sgpr_1, Slit_1, Dgpr_1 ) gt
//Do something SUB( Sgpr_1, Slit_1, Dgpr_1 )
gt //Do something endmatch
22
Studying code size minimization - 6
  • The matcher application needs functions for
    fetching data
  • Bit ordering is different on little and big
    endian machines
  • The matcher fails when large number of complex
    templates are given
  • Breaking large sized multi-ops across 32 bit
    tokens makes the representation messy and error
    prone
  • Specifying addresses for forward branches
    requires two passes

23
Studying impact on memory bandwidth - 1
The Typical VLIW Pipeline
Instruction Decode
Align
Decode
Decompress
Instruction Fetch
DF/AG
Execute
Store Results
24
Studying impact on memory bandwidth - 2
  • The cache simulation requires the generation of,
  • Instruction address
  • No. of bytes to fetch
  • Instruction address can be generated by
    disassembling the instructions at run time and
    keeping track of jumps
  • The matcher application returns the number of
    bytes required to disassemble an instruction
  • The disassembled instruction can be compared with
    the instruction issued to check correctness

25
Studying impact on memory bandwidth - 3
  • Run time verification of disassembled
    instructions can be turned off for faster
    simulation
  • Due to restricted size of matcher results could
    not be obtained for larger programs
  • Memory access addresses and bytes to fetch have
    been generated by hand for SumToN application

26
Results - Impact on code size (Strcpy)
27
Results - Impact on code size (SumToN)
28
Results - Size of SLED specification for various
archs.
29
Results - Cache performance comparison (SumToN)
30
Future work
  • Need for automation in most parts of the
    framework
  • Better representation for VLIW instructions than
    SLED
  • Unlimited token size
  • Facility to bind one field with multiple patterns
  • Methodology for predicting latency for
    decompressor
  • Framework for finding the optimal instruction
    formats

31
Acknowledgements
  • Prof. M.Balakrishnan and Prof. Anshul Kumar
  • Rodric M. Rabbah, Georgia Institute of Technology
  • Shail Aditya, HP Labs
  • All the friends at Philips Lab. for stimulating
    discussions
Write a Comment
User Comments (0)
About PowerShow.com