Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Description:

Application-specific instruction set extension. Another Method for performance improvement ... Generating Custom Instruction for the Target RFU ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit


1
Custom Instruction Generation Using Temporal
Partitioning Techniques for a Reconfigurable
Functional Unit
  • Farhad Mehdipour, Hamid Noori, Morteza Saheb
    Zamani, Kazuaki Murakami, Koji Inoue, Mehdi
    Sedighi
  • Computer and IT Engineering Department,
    Amirkabir University of Technology
  • mehdipur,szamani,msedighi_at_aut.ac.ir
  • Department of Informatics, Graduate School of
    Information Science and Electrical Engineering,
    Kyushu University
  • noori_at_c.csce.kyushu-u.ac.jp, murakami,inoue_at_i.ky
    ushu-u.ac.jp

2
Agenda
  • Introduction
  • Application-specific instruction set extension
  • Temporal Partitioning
  • Some Definitions
  • General overview of the architecture
  • RFU Architecture A Quantitative Approach
  • Generating Custom Instructions
  • Mapping Custom Instructions
  • Integrating RFU with base processor
  • Integrated framework for generating and mapping
    custom instructions
  • Performance Evaluation
  • References

3
Introduction
  • An extensible processor with a reconfigurable
    functional unit (RFU)
  • can be an alternative to General Purpose
    Processors (GPPs), Application-Specific
    Integrated Circuits (ASICs) and
    Application-Specific Instruction set Processors
    (ASIPs)
  • to achieve enhanced performance in embedded
    systems
  • ASICs
  • not flexible
  • expensive and time consuming design process
  • GPPs
  • very flexible
  • may not offer the necessary performance

4
Introduction
  • ASIPs
  • more flexible than ASICs
  • more potential to meet the high-performance
    demands of embedded applications, compared to
    GPPs
  • needs to generation of a complete instruction set
    architecture for the targeted application
  • full-custom solution is too expensive and has
    long design turnaround times

5
Application-specific instruction set extension
  • Another Method for performance improvement
  • An extensible processor with a reconfigurable
    functional unit
  • favorable tradeoff between efficiency and
    flexibility
  • keeping design turnaround time much shorter.
  • Critical portions of an applications dataflow
    graph (DFG) are accelerated by using custom
    functional units
  • The nodes of DFGs -gt instructions of critical
    potions
  • Edges of DFGs -gt dependencies between instructions

6
Temporal Partitioning
  • Partitioning a data flow graph into a number of
    partitions such that
  • each partition can fit into the target hardware
    and
  • dependencies among the graph nodes are not
    violated.

7
Some definitions
  • Hot Basic Block (HBB)
  • A basic block which execution frequency is
    greater than a given threshold specified in the
    profiler
  • Custom Instructions (CIs)
  • Are the extended Instruction Set Architecture
    (ISA) that are executed on the RFU
  • Reconfigurable Functional Unit (RFU)
  • Custom hardware for executing CIs

8
General overview of the architecture
Adaptive Dynamic Extensible Processor
N-way in-order general RISC
Detects start addresses of Hot Basic Blocks (HBBs)
Base Processor
Reg File
Fetch
Augmented Hardware
Decode
Switches between main processor and RFU
Profiler
Execute
RFU
Memory
Sequencer
Write
Executes Custom Instructions
9
Operation modes
Training Mode
Training Mode
Normal Mode
Running Tools for Generating Custom Instructions,
Generating Configuration Data for ACC and
Initializing Sequencer Table
Monitors PC and Switches between main processor
and ACC
Detecting Start Address of HBBs
Applications
Applications
Applications
Binary-Level Profiling
Processor
Processor
Processor
Profiler
Profiler
Profiler
Profiler
ACC
Sequencer
ACC
Sequencer
ACC
Sequencer
Binary Rewriting
Executing CIs
10
Tool Chain
11
Reconfigurable Functional Unit (RFU)
  • RFU is a matrix of Functional Units (FUs)
  • RFU has a two level configuration memory
  • A multi-context memory (keeps two or four config)
  • A cache
  • FUs support only logical operations,
    add/subtract, shifts and compare
  • RFU updates the PC
  • RFU has variable delay which depends on size of
    Custom Instruction

12
RFU Architecture A Quantitative Approach
  • 22 programs of MiBench were chosen
  • Simplescalar toolset was utilized for simulation
  • RFU is a matrix of FUs
  • No of Inputs
  • No of Outputs
  • No of FUs
  • Connections
  • Location of Inputs Outputs
  • Some definitions
  • Considering frequency and weight in measurement
  • CI Execution Frequency
  • Weight (To equal number of executed instructions)
  • Average for all CIs (SFreqWeight)
  • Rejection Percentage of CI that could not be
    mapped on the RFU
  • Coverage Percentage of CI that could be mapped
    on the RFU
  • Basic Blocks A sequence of instructions
    terminates in a control instruction
  • Hot Basic Blocks A basic block executed more
    than a threshold

13
RFU Architecture
  • Distributing Inputs in different rows
  • Row1 7
  • Row 2 2
  • Row 3 2
  • Row 4 2
  • Row 5 1
  • Connections with Variable Length
  • row1 ? row3 1
  • row1 ? row4 1
  • row1 ? row5 1
  • row2 ? row4 1

Synthesis results using Hitachi 0.18 µm Area
1.1534 mm2 Delay 9.66 ns
14
Integrating RFU with the Base Processor
Reg0
Reg31
.
Config Mem
Decoder
Sequencer
DEC/EXE Pipeline Registers
FU1
FU2
FU3
FU4
RFU
Sequencer
EXE/MEM Pipeline Registers
15
Generation of Custom Instructions
  • Custom instructions
  • Exclude floating point, multiply, divide and load
    instructions
  • Include at most one STORE, at most one
    BRANCH/JUMP and all other fixed point
    instructions
  • Simple algorithm for generating custom
    instructions
  • HBBs usually include 1040 instructions for
    Mibench
  • Custom instruction generator is going to be
    executed on the base processor (in online
    training mode)

16
Mapping Custom Instructions
  • Mapping is the same as the well-known placement
    problem
  • Determining the appropriate positions for DFG
    nodes on the RFU.
  • Assigning CI instructions to FUs is done based on
    the priority of the nodes.

17
Mapping Custom Instructions
  • Slack of each node represents its criticality and
    also their priority for partitioning.
  • Slack equal to 0 means that it is on the critical
    path of DFG and should be scheduled with the
    highest priority.
  • For the nodes with the same criticality, ASAP
    level of them determines their mapping order.

18
Mapping Algorithm (1/2)
  • First Step determining an appropriate row for
    that node
  • Row number Last Row (if the selected node is on
    a critical path with the length more than or
    equal to RFU depth)
  • Row number ALAP- slack -1(to prevent the
    occupation of FUs in the lower RFU rows by the
    nodes do not belong to critical paths )

19
Mapping Algorithm (2/2)
  • Second Step Determining an appropriate column
  • That is determined according to the minimum
    connection length criterion.
  • For each row, a maximum capacity is considered to
    prohibit gathering many nodes in a row.
  • Capacity of rows is determined with respect to
    longest critical path and the number of critical
    paths in the DFG.

20
An Example Mapping of a CI on the RFU
21
Generating Custom Instruction for the Target RFU
  • In our primary CI generator we did not consider
    any constraints for the generated CIs and tried
    to generate CIs as large as possible.
  • Therefore, some of the generated CIs can not be
    mapped on the proposed RFU due to its
    constraints.

22
Customizing CI generator for the Target RFU
First Approach
  • Some primary constraints of RFU (number of
    inputs, number of outputs and number of nodes)
    were added to our CI generator tool to generate
    CIs that are mappable.
  • In this approach the CI generator is unaware of
    the mapping process results
  • Some of CIs may not be ultimately mapped to the
    RFU due to the routing constraints

23
Customizing CI generator for the Target RFU
Second Approach
  • Integrated Framework
  • Performs an integrated temporal partitioning and
    mapping process
  • Takes rejected CIs as input
  • Partitions them to appropriate mappable CIs
  • Adds nodes to the current partition while
    architectural constraints are satisfied
  • The ASAP level of nodes represents their order to
    execute according to their dependencies
  • Advantages
  • Reducing the number of rejected CI
  • Using a mapping-aware temporal partitioning
    process

24
Integrated Framework- Temporal Partitioning
Algorithms
  • HTTP
  • Traverses DFG nodes horizontally according to the
    ASAP level of the nodes
  • usually brings about more parallelism for
    instruction execution
  • may require large intermediate data
  • The size of intermediate data affects data
    transfer rate and the size of configuration
    memory.
  • VTTP
  • Traverse the DFG nodes vertically
  • Creates partitions with longer critical paths
  • Reduces the size of intermediate data

25
Integrated Framework- Incremental Temporal
Partitioning Algorithm
  • Incremental temporal partitioning process is
    performed iteratively
  • Each partition which does not satisfy RFU
    constraints is modified
  • A new iteration starts.
  • Two different partition modification strategies
    are used for HTTP and VTTP
  • The main difference is in the way of selecting
    the nodes to be moved to the next partition.

26
Integrated Framework- Incremental Temporal
Partitioning Algorithm
  • Incremental HTTP
  • The node with the highest ASAP level is selected
    and moved to the subsequent partition.
  • Nodes selection and moving order 15, 13, 11, 9,
    14, 12, 10, 8, 3 and 7.

27
Integrated Framework- Incremental Temporal
Partitioning Algorithm
  • Incremental VTTP
  • A node with the highest ASAP level is selected
    and moved.
  • The other nodes are selected from the path where
    the previous moved node had been located in their
    ASAP level order.
  • Nodes selection and moving order15, 14, 6, 13,
    12, 5, 11, 10, 4 and 7.

28
Customizing Mapping Tool
  • Spiral shaped mapping is possible thanks to the
    horizontal connections in the third and fourth
    rows of RFU

29
Performance Evaluation
issue 4-way
L1- I cache 32K, 2 way, 1 cycle latency
L1- D cache 32K, 4 way, 1 cycle latency
Unified L2 1M, 6 cycle latency
Execution units 4 integer, 4 floating point
RUU size 64
Fetch queue size 64
  • Simplescalar was configured to behave as a
    4-issue in-order RISC processor. The base
    processor supports MIPS instruction set.
  • 22 applications of Mibench

30
Delay of RFU according to CI length
CI Length RFU Delay (ns)
1 1.38
2 2.28
3 3.12
4 4.89
5 6.47
6 7.57
7 8.65
8 9.66
  • Synopsys Tools Hitachi 0.18µm

31
CIs length for Mibench applications
32
Intermediate data size
33
Maximum critical path length for CIs
34
Speedup comparison
35
References
  • Arnold, M., Corporaal, H., Designing
    domain-specific processors. In Proceedings of the
    Design, Automation and Test in Europe Conf, 2001,
    pp. 61-66.
  • Atasu, K., Pozzi, L., Lenne, P., Automatic
    application-specific instruction-set extensions
    under microarchitectural constraints, 40th Design
    Automation Conference, 2003.
  • Bobda, C., Synthesis of dataflow graphs for
    reconfigurable systems using temporal
    partitioning and temporal placement, Ph.D thesis,
    Faculty of Computer Science, Electrical
    Engineering and Mathematics, University of
    Paderborn, 2003.
  • Clark, N., Kudlur, M., Park, H., Mahlke, S.,
    Flautner, K., Application-Specific Processing on
    a General-Purpose Core via Transparent
    Instruction Set Customization, In Proceedings of
    the 37th annual IEEE/ACM International Symposium
    on Microarchitecture, 2004.
  • Karthikeya, M., Gajjala, P., Dinesh, B., Temporal
    partitioning and scheduling data flow graphs for
    reconfigurable computer, IEEE Transactions on
    Computers, vol. 48, no. 6, 1999, pp.579590.

36
References
  • Kastner, R. Kaplan, A., Ogrenci Memik, S.,
    Bozorgzadeh, E., Instruction generation for
    hybrid reconfigurable systems, ACM TODAES, vol.
    7, no. 4, 2002, pp. 605-627.
  • Ouaiss, I., Govindarajan, S., Srinivasan, V.,
    Kaul M., Vemuri R., An integrated partitioning
    and synthesis system for dynamically
    reconfigurable multi-FPGA architectures, In
    Proceedings of the Reconfigurable Architecture
    Workshop, 1998, pp. 31-36.
  • Spillane, J., Owen, H., Temporal partitioning
    for partially reconfigurable field programmable
    gate arrays, IPPS/SPDP Workshops, 1998, pp.
    37-42.
  • Tanougast, C., Berviller, Y., Brunet, P., Weber,
    S., Rabah, H., Temporal partitioning methodology
    optimizing FPGA resources for dynamically
    reconfigurable embedded real-time system,
    International Journal of Microprocessors and
    Microsystems, vol. 27, 2003, pp. 115-130.
  • Yu, P., Mitra, T., Characterizing embedded
    applications for instruction-set extensible
    processors, In Proceedings of Design and
    Automation Conference, 2004, pp. 723- 728.

37
  • Thank you for your listening
Write a Comment
User Comments (0)
About PowerShow.com