Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Description:

Application-specific instruction set extension. Another Method for performance improvement ... Generating Custom Instruction for the Target RFU ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 38

Provided by: hamid7

Category:

more less

Transcript and Presenter's Notes

Title: Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

1
Custom Instruction Generation Using Temporal
Partitioning Techniques for a Reconfigurable
Functional Unit

Farhad Mehdipour, Hamid Noori, Morteza Saheb
Zamani, Kazuaki Murakami, Koji Inoue, Mehdi
Sedighi
Computer and IT Engineering Department,
Amirkabir University of Technology
mehdipur,szamani,msedighi_at_aut.ac.ir
Department of Informatics, Graduate School of
Information Science and Electrical Engineering,
Kyushu University
noori_at_c.csce.kyushu-u.ac.jp, murakami,inoue_at_i.ky
ushu-u.ac.jp

2
Agenda

Introduction
Application-specific instruction set extension
Temporal Partitioning
Some Definitions
General overview of the architecture
RFU Architecture A Quantitative Approach
Generating Custom Instructions
Mapping Custom Instructions
Integrating RFU with base processor
Integrated framework for generating and mapping
custom instructions
Performance Evaluation
References

3
Introduction

An extensible processor with a reconfigurable
functional unit (RFU)
can be an alternative to General Purpose
Processors (GPPs), Application-Specific
Integrated Circuits (ASICs) and
Application-Specific Instruction set Processors
(ASIPs)
to achieve enhanced performance in embedded
systems
ASICs
not flexible
expensive and time consuming design process
GPPs
very flexible
may not offer the necessary performance

4
Introduction

ASIPs
more flexible than ASICs
more potential to meet the high-performance
demands of embedded applications, compared to
GPPs
needs to generation of a complete instruction set
architecture for the targeted application
full-custom solution is too expensive and has
long design turnaround times

5
Application-specific instruction set extension

Another Method for performance improvement
An extensible processor with a reconfigurable
functional unit
favorable tradeoff between efficiency and
flexibility
keeping design turnaround time much shorter.
Critical portions of an applications dataflow
graph (DFG) are accelerated by using custom
functional units
The nodes of DFGs -gt instructions of critical
potions
Edges of DFGs -gt dependencies between instructions

6
Temporal Partitioning

Partitioning a data flow graph into a number of
partitions such that
each partition can fit into the target hardware
and
dependencies among the graph nodes are not
violated.

7
Some definitions

Hot Basic Block (HBB)
A basic block which execution frequency is
greater than a given threshold specified in the
profiler
Custom Instructions (CIs)
Are the extended Instruction Set Architecture
(ISA) that are executed on the RFU
Reconfigurable Functional Unit (RFU)
Custom hardware for executing CIs

8
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and RFU
Profiler
Execute
RFU
Memory
Sequencer
Write
Executes Custom Instructions
9
Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
10
Tool Chain
11
Reconfigurable Functional Unit (RFU)

RFU is a matrix of Functional Units (FUs)
RFU has a two level configuration memory
A multi-context memory (keeps two or four config)
A cache
FUs support only logical operations,
add/subtract, shifts and compare
RFU updates the PC
RFU has variable delay which depends on size of
Custom Instruction

12
RFU Architecture A Quantitative Approach

22 programs of MiBench were chosen
Simplescalar toolset was utilized for simulation
RFU is a matrix of FUs
No of Inputs
No of Outputs
No of FUs
Connections
Location of Inputs Outputs
Some definitions
Considering frequency and weight in measurement
CI Execution Frequency
Weight (To equal number of executed instructions)
Average for all CIs (SFreqWeight)
Rejection Percentage of CI that could not be
mapped on the RFU
Coverage Percentage of CI that could be mapped
on the RFU
Basic Blocks A sequence of instructions
terminates in a control instruction
Hot Basic Blocks A basic block executed more
than a threshold

13
RFU Architecture

Distributing Inputs in different rows
Row1 7
Row 2 2
Row 3 2
Row 4 2
Row 5 1
Connections with Variable Length
row1 ? row3 1
row1 ? row4 1
row1 ? row5 1
row2 ? row4 1

Synthesis results using Hitachi 0.18 µm Area
1.1534 mm2 Delay 9.66 ns
14
Integrating RFU with the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
RFU
Sequencer
EXE/MEM Pipeline Registers
15
Generation of Custom Instructions

Custom instructions
Exclude floating point, multiply, divide and load
instructions
Include at most one STORE, at most one
BRANCH/JUMP and all other fixed point
instructions
Simple algorithm for generating custom
instructions
HBBs usually include 1040 instructions for
Mibench
Custom instruction generator is going to be
executed on the base processor (in online
training mode)

16
Mapping Custom Instructions

Mapping is the same as the well-known placement
problem
Determining the appropriate positions for DFG
nodes on the RFU.
Assigning CI instructions to FUs is done based on
the priority of the nodes.

17
Mapping Custom Instructions

Slack of each node represents its criticality and
also their priority for partitioning.
Slack equal to 0 means that it is on the critical
path of DFG and should be scheduled with the
highest priority.
For the nodes with the same criticality, ASAP
level of them determines their mapping order.

18
Mapping Algorithm (1/2)

First Step determining an appropriate row for
that node
Row number Last Row (if the selected node is on
a critical path with the length more than or
equal to RFU depth)
Row number ALAP- slack -1(to prevent the
occupation of FUs in the lower RFU rows by the
nodes do not belong to critical paths )

19
Mapping Algorithm (2/2)

Second Step Determining an appropriate column
That is determined according to the minimum
connection length criterion.
For each row, a maximum capacity is considered to
prohibit gathering many nodes in a row.
Capacity of rows is determined with respect to
longest critical path and the number of critical
paths in the DFG.

20
An Example Mapping of a CI on the RFU
21
Generating Custom Instruction for the Target RFU

In our primary CI generator we did not consider
any constraints for the generated CIs and tried
to generate CIs as large as possible.
Therefore, some of the generated CIs can not be
mapped on the proposed RFU due to its
constraints.

22
Customizing CI generator for the Target RFU
First Approach

Some primary constraints of RFU (number of
inputs, number of outputs and number of nodes)
were added to our CI generator tool to generate
CIs that are mappable.
In this approach the CI generator is unaware of
the mapping process results
Some of CIs may not be ultimately mapped to the
RFU due to the routing constraints

23
Customizing CI generator for the Target RFU
Second Approach

Integrated Framework
Performs an integrated temporal partitioning and
mapping process
Takes rejected CIs as input
Partitions them to appropriate mappable CIs
Adds nodes to the current partition while
architectural constraints are satisfied
The ASAP level of nodes represents their order to
execute according to their dependencies
Advantages
Reducing the number of rejected CI
Using a mapping-aware temporal partitioning
process

24
Integrated Framework- Temporal Partitioning
Algorithms

HTTP
Traverses DFG nodes horizontally according to the
ASAP level of the nodes
usually brings about more parallelism for
instruction execution
may require large intermediate data
The size of intermediate data affects data
transfer rate and the size of configuration
memory.
VTTP
Traverse the DFG nodes vertically
Creates partitions with longer critical paths
Reduces the size of intermediate data

25
Integrated Framework- Incremental Temporal
Partitioning Algorithm

Incremental temporal partitioning process is
performed iteratively
Each partition which does not satisfy RFU
constraints is modified
A new iteration starts.
Two different partition modification strategies
are used for HTTP and VTTP
The main difference is in the way of selecting
the nodes to be moved to the next partition.

26
Integrated Framework- Incremental Temporal
Partitioning Algorithm

Incremental HTTP
The node with the highest ASAP level is selected
and moved to the subsequent partition.
Nodes selection and moving order 15, 13, 11, 9,
14, 12, 10, 8, 3 and 7.

27
Integrated Framework- Incremental Temporal
Partitioning Algorithm

Incremental VTTP
A node with the highest ASAP level is selected
and moved.
The other nodes are selected from the path where
the previous moved node had been located in their
ASAP level order.
Nodes selection and moving order15, 14, 6, 13,
12, 5, 11, 10, 4 and 7.

28
Customizing Mapping Tool

Spiral shaped mapping is possible thanks to the
horizontal connections in the third and fourth
rows of RFU

29
Performance Evaluation
issue 4-way
L1- I cache 32K, 2 way, 1 cycle latency
L1- D cache 32K, 4 way, 1 cycle latency
Unified L2 1M, 6 cycle latency
Execution units 4 integer, 4 floating point
RUU size 64
Fetch queue size 64

Simplescalar was configured to behave as a
4-issue in-order RISC processor. The base
processor supports MIPS instruction set.
22 applications of Mibench

30
Delay of RFU according to CI length
CI Length RFU Delay (ns)
1 1.38
2 2.28
3 3.12
4 4.89
5 6.47
6 7.57
7 8.65
8 9.66

Synopsys Tools Hitachi 0.18µm

31
CIs length for Mibench applications
32
Intermediate data size
33
Maximum critical path length for CIs
34
Speedup comparison
35
References

Arnold, M., Corporaal, H., Designing
domain-specific processors. In Proceedings of the
Design, Automation and Test in Europe Conf, 2001,
pp. 61-66.
Atasu, K., Pozzi, L., Lenne, P., Automatic
application-specific instruction-set extensions
under microarchitectural constraints, 40th Design
Automation Conference, 2003.
Bobda, C., Synthesis of dataflow graphs for
reconfigurable systems using temporal
partitioning and temporal placement, Ph.D thesis,
Faculty of Computer Science, Electrical
Engineering and Mathematics, University of
Paderborn, 2003.
Clark, N., Kudlur, M., Park, H., Mahlke, S.,
Flautner, K., Application-Specific Processing on
a General-Purpose Core via Transparent
Instruction Set Customization, In Proceedings of
the 37th annual IEEE/ACM International Symposium
on Microarchitecture, 2004.
Karthikeya, M., Gajjala, P., Dinesh, B., Temporal
partitioning and scheduling data flow graphs for
reconfigurable computer, IEEE Transactions on
Computers, vol. 48, no. 6, 1999, pp.579590.

36
References

Kastner, R. Kaplan, A., Ogrenci Memik, S.,
Bozorgzadeh, E., Instruction generation for
hybrid reconfigurable systems, ACM TODAES, vol.
7, no. 4, 2002, pp. 605-627.
Ouaiss, I., Govindarajan, S., Srinivasan, V.,
Kaul M., Vemuri R., An integrated partitioning
and synthesis system for dynamically
reconfigurable multi-FPGA architectures, In
Proceedings of the Reconfigurable Architecture
Workshop, 1998, pp. 31-36.
Spillane, J., Owen, H., Temporal partitioning
for partially reconfigurable field programmable
gate arrays, IPPS/SPDP Workshops, 1998, pp.
37-42.
Tanougast, C., Berviller, Y., Brunet, P., Weber,
S., Rabah, H., Temporal partitioning methodology
optimizing FPGA resources for dynamically
reconfigurable embedded real-time system,
International Journal of Microprocessors and
Microsystems, vol. 27, 2003, pp. 115-130.
Yu, P., Mitra, T., Characterizing embedded
applications for instruction-set extensible
processors, In Proceedings of Design and
Automation Conference, 2004, pp. 723- 728.