Title: Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit
1Custom Instruction Generation Using Temporal
Partitioning Techniques for a Reconfigurable
Functional Unit
- Farhad Mehdipour, Hamid Noori, Morteza Saheb
Zamani, Kazuaki Murakami, Koji Inoue, Mehdi
Sedighi - Computer and IT Engineering Department,
Amirkabir University of Technology - mehdipur,szamani,msedighi_at_aut.ac.ir
- Department of Informatics, Graduate School of
Information Science and Electrical Engineering,
Kyushu University - noori_at_c.csce.kyushu-u.ac.jp, murakami,inoue_at_i.ky
ushu-u.ac.jp
2Agenda
- Introduction
- Application-specific instruction set extension
- Temporal Partitioning
- Some Definitions
- General overview of the architecture
- RFU Architecture A Quantitative Approach
- Generating Custom Instructions
- Mapping Custom Instructions
- Integrating RFU with base processor
- Integrated framework for generating and mapping
custom instructions - Performance Evaluation
- References
3Introduction
- An extensible processor with a reconfigurable
functional unit (RFU) - can be an alternative to General Purpose
Processors (GPPs), Application-Specific
Integrated Circuits (ASICs) and
Application-Specific Instruction set Processors
(ASIPs) - to achieve enhanced performance in embedded
systems - ASICs
- not flexible
- expensive and time consuming design process
- GPPs
- very flexible
- may not offer the necessary performance
4Introduction
- ASIPs
- more flexible than ASICs
- more potential to meet the high-performance
demands of embedded applications, compared to
GPPs - needs to generation of a complete instruction set
architecture for the targeted application - full-custom solution is too expensive and has
long design turnaround times
5Application-specific instruction set extension
- Another Method for performance improvement
- An extensible processor with a reconfigurable
functional unit - favorable tradeoff between efficiency and
flexibility - keeping design turnaround time much shorter.
- Critical portions of an applications dataflow
graph (DFG) are accelerated by using custom
functional units - The nodes of DFGs -gt instructions of critical
potions - Edges of DFGs -gt dependencies between instructions
6Temporal Partitioning
- Partitioning a data flow graph into a number of
partitions such that - each partition can fit into the target hardware
and - dependencies among the graph nodes are not
violated. -
7Some definitions
- Hot Basic Block (HBB)
- A basic block which execution frequency is
greater than a given threshold specified in the
profiler - Custom Instructions (CIs)
- Are the extended Instruction Set Architecture
(ISA) that are executed on the RFU - Reconfigurable Functional Unit (RFU)
- Custom hardware for executing CIs
8General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and RFU
Profiler
Execute
RFU
Memory
Sequencer
Write
Executes Custom Instructions
9Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
10Tool Chain
11Reconfigurable Functional Unit (RFU)
- RFU is a matrix of Functional Units (FUs)
- RFU has a two level configuration memory
- A multi-context memory (keeps two or four config)
- A cache
- FUs support only logical operations,
add/subtract, shifts and compare - RFU updates the PC
- RFU has variable delay which depends on size of
Custom Instruction
12RFU Architecture A Quantitative Approach
- 22 programs of MiBench were chosen
- Simplescalar toolset was utilized for simulation
- RFU is a matrix of FUs
- No of Inputs
- No of Outputs
- No of FUs
- Connections
- Location of Inputs Outputs
- Some definitions
- Considering frequency and weight in measurement
- CI Execution Frequency
- Weight (To equal number of executed instructions)
- Average for all CIs (SFreqWeight)
- Rejection Percentage of CI that could not be
mapped on the RFU - Coverage Percentage of CI that could be mapped
on the RFU - Basic Blocks A sequence of instructions
terminates in a control instruction - Hot Basic Blocks A basic block executed more
than a threshold
13RFU Architecture
- Distributing Inputs in different rows
- Row1 7
- Row 2 2
- Row 3 2
- Row 4 2
- Row 5 1
- Connections with Variable Length
- row1 ? row3 1
- row1 ? row4 1
- row1 ? row5 1
- row2 ? row4 1
Synthesis results using Hitachi 0.18 µm Area
1.1534 mm2 Delay 9.66 ns
14Integrating RFU with the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
RFU
Sequencer
EXE/MEM Pipeline Registers
15Generation of Custom Instructions
- Custom instructions
- Exclude floating point, multiply, divide and load
instructions - Include at most one STORE, at most one
BRANCH/JUMP and all other fixed point
instructions - Simple algorithm for generating custom
instructions - HBBs usually include 1040 instructions for
Mibench - Custom instruction generator is going to be
executed on the base processor (in online
training mode)
16Mapping Custom Instructions
- Mapping is the same as the well-known placement
problem - Determining the appropriate positions for DFG
nodes on the RFU. - Assigning CI instructions to FUs is done based on
the priority of the nodes.
17Mapping Custom Instructions
- Slack of each node represents its criticality and
also their priority for partitioning. - Slack equal to 0 means that it is on the critical
path of DFG and should be scheduled with the
highest priority. - For the nodes with the same criticality, ASAP
level of them determines their mapping order.
18Mapping Algorithm (1/2)
- First Step determining an appropriate row for
that node - Row number Last Row (if the selected node is on
a critical path with the length more than or
equal to RFU depth) - Row number ALAP- slack -1(to prevent the
occupation of FUs in the lower RFU rows by the
nodes do not belong to critical paths )
19Mapping Algorithm (2/2)
- Second Step Determining an appropriate column
- That is determined according to the minimum
connection length criterion. - For each row, a maximum capacity is considered to
prohibit gathering many nodes in a row. - Capacity of rows is determined with respect to
longest critical path and the number of critical
paths in the DFG.
20An Example Mapping of a CI on the RFU
21Generating Custom Instruction for the Target RFU
- In our primary CI generator we did not consider
any constraints for the generated CIs and tried
to generate CIs as large as possible. - Therefore, some of the generated CIs can not be
mapped on the proposed RFU due to its
constraints.
22Customizing CI generator for the Target RFU
First Approach
- Some primary constraints of RFU (number of
inputs, number of outputs and number of nodes)
were added to our CI generator tool to generate
CIs that are mappable. - In this approach the CI generator is unaware of
the mapping process results - Some of CIs may not be ultimately mapped to the
RFU due to the routing constraints
23Customizing CI generator for the Target RFU
Second Approach
- Integrated Framework
- Performs an integrated temporal partitioning and
mapping process - Takes rejected CIs as input
- Partitions them to appropriate mappable CIs
- Adds nodes to the current partition while
architectural constraints are satisfied - The ASAP level of nodes represents their order to
execute according to their dependencies - Advantages
- Reducing the number of rejected CI
- Using a mapping-aware temporal partitioning
process
24Integrated Framework- Temporal Partitioning
Algorithms
- HTTP
- Traverses DFG nodes horizontally according to the
ASAP level of the nodes - usually brings about more parallelism for
instruction execution - may require large intermediate data
- The size of intermediate data affects data
transfer rate and the size of configuration
memory. - VTTP
- Traverse the DFG nodes vertically
- Creates partitions with longer critical paths
- Reduces the size of intermediate data
25Integrated Framework- Incremental Temporal
Partitioning Algorithm
- Incremental temporal partitioning process is
performed iteratively - Each partition which does not satisfy RFU
constraints is modified - A new iteration starts.
- Two different partition modification strategies
are used for HTTP and VTTP - The main difference is in the way of selecting
the nodes to be moved to the next partition.
26Integrated Framework- Incremental Temporal
Partitioning Algorithm
- Incremental HTTP
- The node with the highest ASAP level is selected
and moved to the subsequent partition. - Nodes selection and moving order 15, 13, 11, 9,
14, 12, 10, 8, 3 and 7.
27Integrated Framework- Incremental Temporal
Partitioning Algorithm
- Incremental VTTP
- A node with the highest ASAP level is selected
and moved. - The other nodes are selected from the path where
the previous moved node had been located in their
ASAP level order. - Nodes selection and moving order15, 14, 6, 13,
12, 5, 11, 10, 4 and 7.
28Customizing Mapping Tool
- Spiral shaped mapping is possible thanks to the
horizontal connections in the third and fourth
rows of RFU
29Performance Evaluation
issue 4-way
L1- I cache 32K, 2 way, 1 cycle latency
L1- D cache 32K, 4 way, 1 cycle latency
Unified L2 1M, 6 cycle latency
Execution units 4 integer, 4 floating point
RUU size 64
Fetch queue size 64
- Simplescalar was configured to behave as a
4-issue in-order RISC processor. The base
processor supports MIPS instruction set. - 22 applications of Mibench
30Delay of RFU according to CI length
CI Length RFU Delay (ns)
1 1.38
2 2.28
3 3.12
4 4.89
5 6.47
6 7.57
7 8.65
8 9.66
- Synopsys Tools Hitachi 0.18µm
31CIs length for Mibench applications
32Intermediate data size
33Maximum critical path length for CIs
34Speedup comparison
35References
- Arnold, M., Corporaal, H., Designing
domain-specific processors. In Proceedings of the
Design, Automation and Test in Europe Conf, 2001,
pp. 61-66. - Atasu, K., Pozzi, L., Lenne, P., Automatic
application-specific instruction-set extensions
under microarchitectural constraints, 40th Design
Automation Conference, 2003. - Bobda, C., Synthesis of dataflow graphs for
reconfigurable systems using temporal
partitioning and temporal placement, Ph.D thesis,
Faculty of Computer Science, Electrical
Engineering and Mathematics, University of
Paderborn, 2003. - Clark, N., Kudlur, M., Park, H., Mahlke, S.,
Flautner, K., Application-Specific Processing on
a General-Purpose Core via Transparent
Instruction Set Customization, In Proceedings of
the 37th annual IEEE/ACM International Symposium
on Microarchitecture, 2004. - Karthikeya, M., Gajjala, P., Dinesh, B., Temporal
partitioning and scheduling data flow graphs for
reconfigurable computer, IEEE Transactions on
Computers, vol. 48, no. 6, 1999, pp.579590.
36References
- Kastner, R. Kaplan, A., Ogrenci Memik, S.,
Bozorgzadeh, E., Instruction generation for
hybrid reconfigurable systems, ACM TODAES, vol.
7, no. 4, 2002, pp. 605-627. - Ouaiss, I., Govindarajan, S., Srinivasan, V.,
Kaul M., Vemuri R., An integrated partitioning
and synthesis system for dynamically
reconfigurable multi-FPGA architectures, In
Proceedings of the Reconfigurable Architecture
Workshop, 1998, pp. 31-36. - Spillane, J., Owen, H., Temporal partitioning
for partially reconfigurable field programmable
gate arrays, IPPS/SPDP Workshops, 1998, pp.
37-42. - Tanougast, C., Berviller, Y., Brunet, P., Weber,
S., Rabah, H., Temporal partitioning methodology
optimizing FPGA resources for dynamically
reconfigurable embedded real-time system,
International Journal of Microprocessors and
Microsystems, vol. 27, 2003, pp. 115-130. - Yu, P., Mitra, T., Characterizing embedded
applications for instruction-set extensible
processors, In Proceedings of Design and
Automation Conference, 2004, pp. 723- 728.
37- Thank you for your listening