Title: Exploring VLIW ASIP Design Space using Trimaran Framework
1Exploring VLIW ASIP Design Space using Trimaran
Framework
- Under the guidance of
- Prof. Anshul Kumar
- V Srinivasa Reddy
- (2004MCS2453)
2Introduction
- Application Specific Instruction set Processors
(ASIPs) - Gives better Performace than General Purpose
Processors(GPPs) - More flexible than ASICs
- Customizes the processor for a set of
applications through - Instruction Set Extension
- Processor Specialisation
3Course Grain AFUs
- Chaining of operations reduces the computation
time. - Wiring logic reduces the register pressure by
bypassing the values from one FU to other FU of
the AFU. - Reduces spill code for VLIW processors
- If the operands of an operation have a limited
resolution then FU can be made faster. - Parallism can be optimized in AFU.
- Comparision operation can be done simultaneously
in AFU eleminates branching delay
4VLIW Processors
- VLIW processors
- Better ILP for numeric programs
- Static Scheduling
- ILP is limited by the branch statements
5Handling branch statements in VLIW processors
- Pipeline strategies
- branch always taken
- branch always not taken
- Predicated execution
- Needs the support of processor
- Control dependency is eleminated to data
dependency
6Pipeline Strategy
- Consider a 5 stage pipeline processor
- Processor with branch always not taken strategy
Condition true cycles wasted 1
Condition false cycles wasted 3
7Predicated Execution
- Present VLIW architectures supports predicated
instructions - HPL PD architecture
- Each operation have one bit extra operand called
predicate register - e.g., r2 ADD.W r1,r3 if p
- If the predicate register contains 0 operation is
not performed - The value of predicate registers are typically
set by compare-to-predicate operations - P CMPP. lt r4,r5
8Example
- Advantage
- If conversion
- Control dependency height reduction
- Disadvantage
- Slots get wasted due to predicated instructions
9ASIP methodology
if (altb) ca-b else cab
- Advantage
- Comparison can be done simultaneously
- Stores the intermediate results in wires
10How to handle variable latency AFUs in VLIW ?
- Schedule the instructions taking the minimum
latency path of the AFU. - If it takes longest path stall the pipeline till
the results come out - Similar to pipeline branch handling strategy
11Pipeline flow
Taken Maximum latency path
Taken Minimum latency path
Issue Instruction in next cycle Stall in 3rd
Cycle
Issue Instruction in next cycle
Min lt 2 max lt 3
12How to handle Inputs and Outputs Of AFUs
- All the inputs are read at the beginning and
written at the end - Need to find the Deterministic AFUs
- Compare instruction can be performed
simultaneously - Need Data Flow and control flow analysis
- Time shaping of inputs and outputs
- Compare instruction cannot be performed
simultaneously
13Comparison
m, n are schedule lengths
Predicated Execution max(n,m)1 lt schedule
length lt nm1 ASIP methodology min(n,m) lt
schedule length lt max(n,m)
14Comparison (cont ...)
Super block
Hyper block
- Super block
- schedule length (1-p)(n3) p(m1)
- let nm and p1-p 0.5 (i.e., equal probability)
- S L 0.5 (n3) 0.5 (n1) n2
15Comparison(cont ... )
16Identification of Special AFUs
Identification Algorithm
Selection Criteria
17Earlier work
- Used Machsuif
- identification
- Evaluation
- Selection
- Trimaran
- Finding Statistics
18Limitations of earlier work
- Identification is based on different architecture
- Evaluation and selection doesnot consider VLIW
features into consideration - If-conversion modified the program in MachSuif
which will decrease the performance
19Trimaran Framework
source www.trimaran.org
20Estimation Approach
- Identification
- Finding deterministic computational blocks(CBs)
- Use def-use and use-def chains to find CBs
- Renaming the registers
- Evaluation of CBs
- Fully Perdicate the Sub region
- Critical path reduction(CPR)
- Control CPR
- Data CPR
- Schedule the sub region for infinite resources
- Estimate the performace improvement, area and
input and outputs of the CB
21Estimation (cont...)
- Selection
- Multi objective problem
- select the CBs which satisfies the multiple
objectives
22Ifthen benchmark(source code)
int main() int i,a,b,c,d,e,f,g
abcdefg0 for (i0 ilt200 i)
int x i2 int y i4 a if
(x0) b if (y0)
c else d
else e if (x0)
f else g
printf("ad bd cd dd ed fd gd\n",
a,b,c,d,e,f,g) exit(0)
23CDFG
s l weight bb 1 7
1 bb 2 3 1 bb 3
7 200 bb 4 2 100
bb 5 2 50 bb 9 2 200
bb 10 2 100 bb 12
3 200 bb 13 9 1 bb 14
9 1 bb 8 2 100
bb 11 2 100 bb 6 2 50
24 op 1 (SHRA_W brlt27i gpr 3gt brlt1i gpr
2gt ilt31gt plttgt s_time(0) s_opcode(SHRA_W.0)
flags(sched)) op 2 (ADD_W brlt2i gpr 5gt
brlt2i gpr 5gt ilt1gt plttgt s_time(0)
s_opcode(ADD_W.2) flags(sched)) op 3
(AND_W brlt24i gpr 6gt brlt27i gpr 3gt ilt1gt
plttgt s_time(1) s_opcode(AND_W.0) flags(sched))
op 4 (AND_W brlt28i gpr 7gt brlt27i gpr 3gt
ilt3gt plttgt s_time(1) s_opcode(AND_W.1)
flags(sched)) op 5 (ADD_W brlt25i gpr
8gt brlt1i gpr 2gt brlt24i gpr 6gt plttgt s_time(2)
s_opcode(ADD_W.0) flags(sched)) op 6
(ADD_W brlt29i gpr 9gt brlt1i gpr 2gt brlt28i
gpr 7gt plttgt s_time(2) s_opcode(ADD_W.1)
flags(sched)) op 7 (AND_W brlt26i gpr
10gt brlt25i gpr 8gt ilt-2gt plttgt s_time(3)
s_opcode(AND_W.0) flags(sched)) op 8
(AND_W brlt30i gpr 11gt brlt29i gpr 9gt ilt-4gt
plttgt s_time(3) s_opcode(AND_W.1) flags(sched))
op 9 (SUB_W brlt9i gpr 4gt brlt1i gpr 2gt
brlt26i gpr 10gt plttgt s_time(4)
s_opcode(SUB_W.0) flags(sched)) op 10
(SUB_W brlt10i gpr 12gt brlt1i gpr 2gt brlt30i
gpr 11gt plttgt s_time(4) s_opcode(SUB_W.1)
flags(sched)) op 11 (CMPP_W_NEQ_UN_UC brlt1p pr
1gt brlt2p pr 2gt brlt9i gpr 4gt ilt0gt plttgt
s_time(5) s_opcode(CMPP_W_NEQ_UN_UN.0)
flags(sched)) //bb8 op 12 (ADD_W
brlt6i gpr 14gt brlt6i gpr 14gt ilt1gt plt1gt
s_time(0) s_opcode(ADD_W.1) flags(sched)) //bb4
op 13 (ADD_W brlt3i gpr 13gt brlt3i gpr 13gt
ilt1gt plt2gt s_time(0) s_opcode(ADD_W.2)
flags(sched)) op 14 (CMPP_W_NEQ_UN_UC
brlt3p pr 3gt brlt4p pr 4gt brlt10i gpr 12gt
ilt0gt plt2gt s_time(0)
s_opcode(CMPP_W_NEQ_UN_UN.1) flags(sched)) //bb6
op 15 (ADD_W brlt5i gpr 18gt brlt5i gpr 18gt
ilt1gt plt3gt s_time(0) s_opcode(ADD_W.1)
flags(sched)) //bb5 op 16 (ADD_W brlt4i gpr
17gt brlt4i gpr 17gt ilt1gt plt4gt s_time(0)
s_opcode(ADD_W.0) flags(sched)) op 17
(CMPP_W_NEQ_UN_UN brlt5p pr 5gt brlt6p pr 6gt
brlt9i gpr 4gt ilt0gt plttgt s_time(0)
s_opcode(CMPP_W_NEQ_UN_UN.1) flags(sched)) //bb10
op 18 (ADD_W brlt7i gpr 15gt brlt7i gpr 15gt
ilt1gt plt6gt s_time(0) s_opcode(ADD_W.0)
flags(sched)) //bb11 op 19 (ADD_W brlt8i gpr
16gt brlt8i gpr 16gt ilt1gt plt5gt s_time(0)
s_opcode(ADD_W.1) flags(sched)) //bb12 op 20
(ADD_W brlt1i gpr 2gt brlt1i gpr 2gt ilt1gt plttgt
s_time(0) s_opcode(ADD_W.0) flags(sched)) op
21 (PBRR brlt38b btr 2gt blt3gt ilt1gt plttgt
s_time(0) s_opcode(PBRR.1) attr(lc 183)
flags(sched)) op 22 (CMPP_W_LT_UN_UN
brlt7p pr 7gt ultgt brlt1i gpr 2gt ilt200gt
plttgt s_time(1) s_opcode(CMPP_W_LT_UN_UN.
0) flags(sched)) op 23 (BRCT
brlt38b btr 2gt brlt39p pr 7gt plttgt s_time(2)
s_opcode(BRCT.0) attr(lc 185)
out(op-93(199) op-101(1)) flags(sched))
25DDG
2
1
20
21
22
3
4
23
5
6
cycle slot1 slot2 slot3 slot4 0 1 2
20 21 1 3 4 22 2 5 6 3 7 8 4 9 10
5 17 11 6 18 19 12 14 7 13 15 16 23
Schedule length 8 Total cycles 200 8
7 3 9 9 1628 cycles
7
8
9
10
17
11
14
19
18
12
13
15
16
26AFU
cycle slot1 slot2 slot3 slot4 0 1 2
20 21 1 3 4 22 2 5
6 3 7 8 4 9 10 5
17 11 18 19 12 6
14 15 16 23 13
Schedule length 7 Total cycles
200 7 7 3 9 9 1428
cycles
27Comparison
28Work Done
- Related work
- diviya jain's work
- Bhuvan Middha's work
- Various identification algorithms
- Elcor backend
- Understood all the passes of elcor backend
- Read related material to understand every pass of
elcor - Written program to identify simple if else
Computation blocks - Developed evaluation methodology
29Future work
- Finding Algorithm that considers the VLIW
architecture into consideration - Implementing the evaluation and selection
strategy in elcor - Availability of different cache levels and direct
accessability gives the opportunity to implement
loops in AFUs