Configuration Code Generation and Optimizations for Heterogeneous Reconfigurable DSPs SuetFei Li - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Configuration Code Generation and Optimizations for Heterogeneous Reconfigurable DSPs SuetFei Li

Description:

... 'optimized' fashion according to a user provided set of parameters (Power, ... Overall Software Flow: From high level description to implementation ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 10
Provided by: suetf
Category:

less

Transcript and Presenter's Notes

Title: Configuration Code Generation and Optimizations for Heterogeneous Reconfigurable DSPs SuetFei Li


1
Configuration Code Generation and Optimizations
for Heterogeneous Reconfigurable DSPs Suet-Fei
Li
  • Motivation
  • Our work is on code generation and optimization
    process for reconfigurable architectures
    targeting digital signal processing and wireless
    communication applications.
  • The ability to generate efficient and compact
    code is essential for the success of
    reconfigurable architectures. Otherwise, the
    overhead of reconfiguring could easily become the
    system bottleneck.
  • Our code generation process includes the
    evaluation a set of tradeoffs in system design,
    software engineering as well as usage of a set of
    local and global optimization techniques. By
    doing so we are able to achieve results of
    significant lower overhead.
  • In the future, we will also approach the problem
    from the hardware side and push the system
    performance to a level which software alone can
    not achieve. .

2
Pleiades Architecture
  • Hardware
  • Consist of embedded microprocessor and
    reconfigurable coprocessors of different
    programming granularities
  • Support dynamic configuration of the coprocessors
    and interconnect.
  • Software
  • DSP Applications (real time speech processor etc)
    are described with a high level language (C or
    C)
  • Complete software tool support is provided to
    implement the high level description on Pleiades
    hardware in a optimized fashion according to a
    user provided set of parameters (Power,
    performance or any mixture of the two).

3
Overall Software Flow From high level
description to implementation
  • The input to the flow is an algorithm specified
    in C/C.
  • Perform mapping and partitioning on the Pleiades
    architecture according to a set of optimization
    parameters.
  • Divide the application program into two parts
    The control section which is to be implemented in
    the embedded microprocessor and the computational
    intensive loops (kernels), which are to be
    implemented in hardware.
  • These kernels are encapsulated in procedure calls
    and described in a high-level intermediate form.
    The intermediate form (currently implemented in
    C) is based on the concept of processes
    (satellites) and queues (connection between
    satellites).
  • The intermediate form has all satellite
    functionality and connectivity information and is
    used to generate efficient and compact
    configuration and interface code. A code
    generation library (corresponding to each
    satellite as well as the whole kernel) is
    provided to serve the purpose of automatic code
    generation. The library currently contains all
    the basic building blocks (satellites) in
    Pleiades. The kernel specified in the
    intermediate form is like a procedure call the
    structure is fixed for each kernel but each
    invocation of the kernel can pass in different
    parameters.

4
Issues and trade-off in code generation
  • Static Vs dynamic generation
  • Static means the configuration and interface code
    can be determined at compile time.
  • Dynamic means the code can only be determined at
    runtime.
  • Static code generation boosts performance, while
    dynamic generation could result in more compact
    code. Configurations for different elements in
    the architecture are distinct and should be
    treated differently. Static, dynamic or a
    mixture of both then are applied accordingly.
  • Trade-off and Local Optimizations
  • Performance/Power Vs. Code Size trade-off ----
    flat Vs modular
  • Flat program is the fastest, but infeasible for
    memory. We introduce some modularity to reduce
    memory requirement

5
Optimizations (cont..)
  • Generality Vs. Performance/Power tradeoff
  • AGP configuration is the bottle neck of the
    system, and customized AGP configuration routines
    for the different kernels are provides to trade
    generality for speed.
  • Due to the sequential nature of the application,
    we replace the sophisticated while expensive
    interrupt handling communication primitive
    between the Embedded processor and co-processors
    with a simpler, cheaper one.
  • Global Optimizations
  • Program caching Cache instructions and
    procedures that will be reused often.
  • AGP instruction registers are deeper and can
    support up to 5 contexts.
  • Since kernels within DSP applications have only
    limited instruction patterns, all AGP
    instructions are often stored in the
    reconfigurable satellites without reconfiguration
    from the processor.
  • Partial reconfiguration
  • During the execution of the application, not all
    co-processors have to be fully reconfigured.
  • For example, When two identical kernels are
    called sequentially, only part of the
    configuration data has to be loaded into the
    satellites. Only do reconfiguration when
    necessary.

6
Results -- evaluation of the code generation
process for a speech coding algorithm implemented
on Maia chip (Pleiades architecture)
  • Case study Implementing a 16-bit VSELP encoder.
  • Before The total number of cycles required by
    VSELP is 126M while it runs entirely on ARM8.
  • After All kernels in the VSELP algorithm have
    been selected and mapped to the satellites 10,
    18.9M cycles remain on the microprocessor (not
    including configuration cycles).
  • The optimization Process
  • Before any Optimization The total cycle count
    (48.6M) is significantly smaller than the
    original one but the fact that more time is spent
    on configuration than computation is not very
    satisfactory.
  • Table TOTAL VSELP cycle breakdown per kernel
    before optimization.

7
The optimization result
1)Kernel specific AGP configuration
8
Optimization II Kernel specific post processing
Table Performance comparison between interrupt
handling scheme and kernel specific post
processing scheme.
9
Optimization III AGP program caching and
partial reconfiguration --- Final result.
Performance saving by using AGP program
allocation and partial reconfiguration.
Write a Comment
User Comments (0)
About PowerShow.com