Title: Configuration Code Generation and Optimizations for Heterogeneous Reconfigurable DSPs SuetFei Li
1Configuration Code Generation and Optimizations
for Heterogeneous Reconfigurable DSPs Suet-Fei
Li
- Motivation
- Our work is on code generation and optimization
process for reconfigurable architectures
targeting digital signal processing and wireless
communication applications. - The ability to generate efficient and compact
code is essential for the success of
reconfigurable architectures. Otherwise, the
overhead of reconfiguring could easily become the
system bottleneck. - Our code generation process includes the
evaluation a set of tradeoffs in system design,
software engineering as well as usage of a set of
local and global optimization techniques. By
doing so we are able to achieve results of
significant lower overhead. - In the future, we will also approach the problem
from the hardware side and push the system
performance to a level which software alone can
not achieve. .
2Pleiades Architecture
- Hardware
- Consist of embedded microprocessor and
reconfigurable coprocessors of different
programming granularities - Support dynamic configuration of the coprocessors
and interconnect. - Software
- DSP Applications (real time speech processor etc)
are described with a high level language (C or
C) - Complete software tool support is provided to
implement the high level description on Pleiades
hardware in a optimized fashion according to a
user provided set of parameters (Power,
performance or any mixture of the two).
3Overall Software Flow From high level
description to implementation
- The input to the flow is an algorithm specified
in C/C. - Perform mapping and partitioning on the Pleiades
architecture according to a set of optimization
parameters. - Divide the application program into two parts
The control section which is to be implemented in
the embedded microprocessor and the computational
intensive loops (kernels), which are to be
implemented in hardware. - These kernels are encapsulated in procedure calls
and described in a high-level intermediate form.
The intermediate form (currently implemented in
C) is based on the concept of processes
(satellites) and queues (connection between
satellites). - The intermediate form has all satellite
functionality and connectivity information and is
used to generate efficient and compact
configuration and interface code. A code
generation library (corresponding to each
satellite as well as the whole kernel) is
provided to serve the purpose of automatic code
generation. The library currently contains all
the basic building blocks (satellites) in
Pleiades. The kernel specified in the
intermediate form is like a procedure call the
structure is fixed for each kernel but each
invocation of the kernel can pass in different
parameters.
4Issues and trade-off in code generation
- Static Vs dynamic generation
- Static means the configuration and interface code
can be determined at compile time. - Dynamic means the code can only be determined at
runtime. - Static code generation boosts performance, while
dynamic generation could result in more compact
code. Configurations for different elements in
the architecture are distinct and should be
treated differently. Static, dynamic or a
mixture of both then are applied accordingly. - Trade-off and Local Optimizations
- Performance/Power Vs. Code Size trade-off ----
flat Vs modular - Flat program is the fastest, but infeasible for
memory. We introduce some modularity to reduce
memory requirement
5Optimizations (cont..)
- Generality Vs. Performance/Power tradeoff
- AGP configuration is the bottle neck of the
system, and customized AGP configuration routines
for the different kernels are provides to trade
generality for speed. - Due to the sequential nature of the application,
we replace the sophisticated while expensive
interrupt handling communication primitive
between the Embedded processor and co-processors
with a simpler, cheaper one. - Global Optimizations
- Program caching Cache instructions and
procedures that will be reused often. - AGP instruction registers are deeper and can
support up to 5 contexts. - Since kernels within DSP applications have only
limited instruction patterns, all AGP
instructions are often stored in the
reconfigurable satellites without reconfiguration
from the processor. - Partial reconfiguration
- During the execution of the application, not all
co-processors have to be fully reconfigured. - For example, When two identical kernels are
called sequentially, only part of the
configuration data has to be loaded into the
satellites. Only do reconfiguration when
necessary.
6Results -- evaluation of the code generation
process for a speech coding algorithm implemented
on Maia chip (Pleiades architecture)
- Case study Implementing a 16-bit VSELP encoder.
- Before The total number of cycles required by
VSELP is 126M while it runs entirely on ARM8. - After All kernels in the VSELP algorithm have
been selected and mapped to the satellites 10,
18.9M cycles remain on the microprocessor (not
including configuration cycles). - The optimization Process
- Before any Optimization The total cycle count
(48.6M) is significantly smaller than the
original one but the fact that more time is spent
on configuration than computation is not very
satisfactory. - Table TOTAL VSELP cycle breakdown per kernel
before optimization.
7The optimization result
1)Kernel specific AGP configuration
8Optimization II Kernel specific post processing
Table Performance comparison between interrupt
handling scheme and kernel specific post
processing scheme.
9Optimization III AGP program caching and
partial reconfiguration --- Final result.
Performance saving by using AGP program
allocation and partial reconfiguration.