Title: Automated Synthesis of Efficient Binary Decoders for Retargetable Software Toolkits
1Automated Synthesis of Efficient Binary Decoders
for Retargetable Software Toolkits
- Wei Qin, Sharad Malik
- Princeton University
2Overview
- Increasing number of ASIPs
- Software tool chain to exploit programmability
Validation tools
disassembler
Synthesis tools
Machine Code
debugger
assembler/ linker
compiler
instruction set simulator
Cycle simulator
Binary translator
Machine Descriptions
3Outline
- Motivation
- Background
- Related Work
- Problem Formulation
- Decoder Construction
- Experimental Results
- Conclusion
4Motivation
- Software binary decoding
- Sequential vs. hardware parallel
- Control flow intensive
- Error prone for complex instruction sets
- Can be a performance bottleneck
- 2-4 times slower for instruction set simulation
(ISS) - Efficient decoder synthesis algorithm desirable
- Focus on opcode decoding
- Operand decoding is straightforward thereafter
5Background
- Instruction pattern
- Common software decoding schemes
- Pattern testing
- (inst_word a_mask) a_signature ?
instruction a - Table lookup
- inst_table(inst_wordgtgtshift) mask ? get id
- or
- switch ((inst_wordgtgtshift) mask)
- case id1 ? handle
id1 -
-
----00101000-------------------- add
pattern
00001111111100000000000000000000 add
mask 00000010100000000000000000000000
add signature
6Background (contd)
- Masks of the ARM instruction patterns
- Which bits to look into first?
00001110000100000000000000000000 00001111000000000
000000000000000 00001111010100000000000000000000 0
0001111010100000000000000010000 000011111111000000
00000000000000 00001110010100000000000011110000 00
001111111100000000000000010000 0000111111110000000
0000010010000 00001111111100000000000011110000 000
01111111100001111000000000000 00001110010100000000
111111110000 00001111111100000000111111110000 0000
1111111100001111111111110000 000011111111111100001
11111111111
7Related Work
- Sequential decoder Hadjiyiannis 99
- List search instruction patterns
- fool proof, straightforward
- poor performance
8Related Work
- Language guided decoder generation
- SLED in NJ machine-code toolkit Ramsey 95
- Group fields in hierarchical tables
- Good quality
- Language dependent
11110000000000000000000000000000
00001111110000000000000010000011
alu mem
ldw ldh ldb stw
mode1 mode2
imm reg
9Related Work (contd)
- Caching decoding results in simulation Nohl 02
- Exploit locality to avoid repeated decoding
- Tolerant to slow decoder
- Large cache, worst case performance
IW
PC
cache
Hit?
decoding result
decode
10Related Work (contd)
- Decision tree based decoding Theiling 01
- Decoding only common significant bits
- Relatively tall tree
- Deadlock on certain patterns
000--- a 001--- b 01---- c 10---- d
0-1--- a -10--- b 10---- c
11Problem Formulation
- Definitions
- Bit pattern p ? 0,1,?n ? cube
- Bit string s ? 0,1n ? minterm
- Pattern match ? minterm in the cube
- Decoding entry
- Triple of (pattern, label, probability).
- Well-formed entry set
- Entries with different labels do not overlap
- Binary decoder
- Mapping bit strings to matching entries
- ----00101000-------------------- add pattern
- 11100010100000110011000000000100 add r3, r3, 4
(----00101000--------------------, add, 0.15)
12Problem Formulation (contd)
- Decoding tree
- (N?D,Edges)
f1(i)
f2(i)
f3(i)
f4(i)
Ei
Ek
Ej
General decoding tree
13Problem Formulation (contd)
- Decoding cost modeling
- Execution time
- Average decoding height
- Memory consumption
- Not 2n
- Small enough to fit in a small part of the cache
- Problem Statement
- Input Well-formed decoding entry set and memory
constraint - Output Decoding tree with minimum Havg.
?i ? probabilty of ei D ? decoding height
14Decoder Construction
- Decision function candidates
- Pattern decoding ? two children
- (iw mask)signature
- Total number 3n-1
- Table decoding ? 2m children
- table(iwgtgtshift)bit_mask
- Contiguous bits
- Total number n(n1)/2
- Simple, low execution time, effective
15Decoder Construction (contd)
(000,l1,.25) (001,l2,.25) (01-,l3,.25)
(1--,l4,.25)
Havg1.5
16Decoder Construction (contd)
- Construction of decoding tree ? brute force
- Problems
- Too many function candidates
- 3n-1 pattern function, n(n1)/2 table function
- Prune search space
- Too deep recursion
- Estimate costs for subtrees
foreach decoding_function_candidate divide
entry set recursively construct trees for
subsets sum weighted costs of sub-trees and the
function itself select the function with the
least overall cost
17Field Growing Heuristics
- Prune function candidates
- Field growing heuristics to prune function space
------1-------------------------
18Tree Cost Estimation
- Subtree cost estimation
- Use cost of binary decoding tree as a relative
metric - Tree height estimate
- Huffman tree as a lower bound for binary tree
height - Memory consumption estimation
- Internal tree nodes
- Decoding tables
- Binary tree
- E-1 nodes
- 0 tables
19Tree Cost Estimation (contd)
- Total memory of using a decoding function
- Memory efficiency ratio
- Overall cost function of a decoding function
?i ? Probability of sub-tree i Hi ? Huffman tree
height ?? Memory penalty factor
20Decoder Construction (contd)
(000,l1,.25) (001,l2,.25) (01-,l3,.25)
(1--,l4,.25)
Havg1.5
2 nodes 4 table entries 6 units
21Decoder Construction (contd)
- Decoding Tree Alternative
(000,l1,.25)
000
(001,l2,.25)
001
(000,l1,.25) (001,l2,.25) (01-,l3,.25)
(1--,l4,.25)
(010,l3,.125)
010
011
(011,l3,.125)
100
(inst7)
(100,l4,.0625)
101
110
(101,l4,.0625)
(110,l4,.0625)
111
Havg1
(111,l4,.0625)
1 node 8 table entries 9 units
22Decoder Construction (contd)
(000,l1,.25)
0
(000,l1,.25) (001,l2,.25)
0
(000,l1,.25) (001,l2,.25) (01-,l3,.25)
(inst1)
1
(000,l1,.25) (001,l2,.25) (01-,l3,.25)
(1--,l4,.25)
0
(001,l2,.25)
1
(01-,l3,.25)
inst2
inst4
1
(1--,l4,.25)
Havg2
23Decoder Construction (contd)
(001---,a,.1)
00
(011---,a,.1)
0
(011---,a,.1) (010---,b,.1)
(0-1---,a,.2) (-10---,b,.2) (10----,c,.6)
01
1
(010---,b,.1)
(instgtgt3)1
10
(instgtgt4)3
(10----,c,.6)
11
(110---,b,.1)
Havg1.2
24Experimental Results
- Two ISAs
- ARM 137 instructions 50 unused patterns
- PowerPC 148 instructions 130 unused patterns
- Benchmarks
- Training set go, li, compress, gcc, gzip
- Running set mcf, parser, vortex, bzip2, twolf
- Instruction Set Simulators (100X)
- ARM 8.88MIPS
- PowerPC 8.15MIPS
IDEF(ld1_imm_p, 0x0f500000, 0x05100000,
1.923343e-01) IDEF(mov_2, 0x0ff00010,
0x01a00000, 1.271039e-01) IDEF(branch,
0x0f000000, 0x0a000000, 7.793878e-02)
On PIII 800MHz Linux
25Experimental Results (contd)
Memory penalty factor
26Experimental Results (contd)
27Experimental Results (contd)
- Comparison of decoders
- Trained sequential
- Dependent on benchmarks, compiler
- For SPECFp (alvinn, art, equake), Havg29.9,
5.27MIPS for PowerPC
28Conclusions
- Decision tree based binary decoder
- Based on pattern decoding and table decoding
primitives - Patterns split to reduce tree height
- Field growing heuristics to prune search space
- Huffman tree height as execution time estimation
- Memory utilization ratio as memory estimation
- Advantages
- High quality with ensured correctness
- Speed comparable with hand-coded decoder
- No limitation on instruction set
- Safe to use on ASIPs with irregular encoding
- Simple input format
- Can be obtained from any machine description with
encoding information