Title: Dynamic Reconfiguration with Binary Translation: Breaking the ILP Barrier with Software Compatibilit
1Dynamic Reconfiguration with Binary Translation
Breaking the ILP Barrier with Software
Compatibility
- Antonio Carlos S. Beck Filho caco_at_inf.ufrgs.br
- Luigi Carro
- carro_at_inf.ufrgs.br
- Informatics Institute - LSE
- Federal University of Rio Grande do Sul
- Brazil
2Motivation
3Motivation
- How to increase performance with low energy
consumption? - Using a reconfigurable array!
4Motivation
- How to increase performance with low energy
consumption? - Using a reconfigurable array!
Binary Translation (BT)
5Motivation
- How to increase performance with low energy
consumption? - Using a reconfigurable array!
- Special tools and/or compilers are needed
- No software compatibility!
- What happens to the design cycle?
6Outline
- Introduction
- The Java processors
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
7Outline
- Introduction
- The Java processors
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
8Introduction
1
- Advantages of using a reconfigurable array
- Speeds up sequences of instructions that are not
necessarily data independent
9Introduction
1
Time (clock cycles)
Data dependent instructions same color
10Introduction
1
- Dynamic Analysis (BT)
- Allows to find sequences of instructions to be
executed in the array at run-time
11Introduction
1
- Dynamic analysis in recent works
- Fine-Grain arrays and FPGAs
- Increases the complexity of detection
- Increases the cache responsible for keeping the
configurations - Just in critical parts of the software
- In this work
- Coarse-Grain Array
- Detection of the instructions becomes simpler
- Small amount of memory required
- Optimizes any sequence in the software
- Technology independent
12Introduction
1
- Java processor as case study
- Object Oriented
- Modeling
- Programming
- Validation
- Multiplatform
- Widely spread
- Moreover, makes the detection algorithm and the
routing simpler
13Introduction
1
Coarse-Grain Reconfigurable Array
Dynamic Detection (Binary Translation)
Performance
Energy Consumption
14Outline
- Introduction
- The Java processors
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
15Femtojava Pipelined
2
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
16Femtojava VLIW
2
- The VLIW version basically is an extension of the
pipelined one - 2, 4 or 8 instructions/packet
- VLIW packet has a variable size
- Functional Units replicated
17Outline
- Introduction
- The Java processors
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
18The Reconfigurable Array
3
- The coarse-grain array is tightly coupled
- Decreases the overhead in the communication
- No external access to the array
19The Reconfigurable Array
3
- The coarse-grain array is tightly coupled
- Decreases the overhead in the communication
- No external access to the array
- It is formed by one or more basic cells
20The Reconfigurable Array
3
- The coarse-grain array is tightly coupled
- Decreases the overhead in the communication
- No external access to the array
- It is formed by one or more basic cells
- With one multiplier
21The Reconfigurable Array
3
- The coarse-grain array is tightly coupled
- Decreases the overhead in the communication
- No external access to the array
- It is formed by one or more basic cells
- With one multiplier
- A sequence of seven sets of basic functional units
22How it works?
3
Instruction flow
5A 32 B7
4C 63
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
23How it works?
3
Sequence found!
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
24How it works?
3
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
25How it works?
3
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
26How it works?
3
Instruction Fetch
Operand Fetch
Decoding
Write Back
Execution
27Outline
- Introduction
- The Java processors
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
28How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
Stack
29How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
10
30How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
5
10
31How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
50
32How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
50
33How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
3
50
34How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
4
3
50
35How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
48
50
36How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
48
50
37How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
98
38How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
98
39How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
40How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
41How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
- These instructions depend on each other!
42How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
6
43How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
7
6
44How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
42
45How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
42
46How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
47How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
48How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
49How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
50How BT Works
4
Sequence of instructions
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 Imul Istore
Operand Block 1 First Sequence
Operand Block 2 Second Sequence
51Outline
- Introduction
- The Java processor
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
52Results - Benchmarks
5
- A set of algorithms were executed in the
architectures - Sin Calculation
- Sort Bubble
- Sort Select
- Sort Quick (10 and 100 elements)
- Search Binary
- Search Sequential
- IMDCT (plus three unrolled versions)
- Floating Point Sums emulation
- Full MP3 Player
53Results - Performance
5
54Results - Performance
5
55Results - Performance
5
56Results - Performance
5
57Results - Performance
5
58Results - Performance
5
59Results - Performance
5
60Results - Performance
5
61Results - Performance
5
62Results - Performance
5
63Results - Performance
5
64Results - Performance
5
Bubble 10 3.4x faster Bubble 100 5.5x faster
65Energy consumption - ROM
5
66Energy consumption - RAM
5
67Energy Consumption - Core
5
68Energy Consumption - Total
5
69Results - Area
5
Pipelined
VLIW 2
FEMTOJAVA REC. ARRAY 4 cells 5 dif.
reconfigurations
VLIW 4
VLIW 8
70Results - Area
5
Low Power
Reconfiguration Cache
VLIW 2
Femtojava Pipelined
BT Logic
Data Cache
VLIW 4
VLIW 8
Reconfigurable Array
71Outline
- Introduction
- The Java processor
- The reconfigurable array
- How the BT algorithm works
- Results
- Conclusions and Future Work
72Conclusions
6
- With BT, a reconfigurable array and Java we
achieve at the same time - Software portability
- Performance
- Low Energy Consumption
73Future Work
6
- Use dynamic analysis with CMP
- At run-time detects which is the best core to
execute the software at certain time - Implement the BT and reconfigurable array in
traditional RISC machines - What are the differences of implementation?
- Tradeoffs analysis
74The end
- Thank you!!!
- caco_at_inf.ufrgs.br
- carro_at_inf.ufrgs.br