Wormhole RTR FPGAs with Distributed Configuration Decompression

About This Presentation

Title:

Wormhole RTR FPGAs with Distributed Configuration Decompression

Description:

Wormhole RTR FPGAs with Distributed Configuration Decompression CSE-670 Final Project Presentation A Joint Project Presentation by: Ali Mustafa Zaidi – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 66

Provided by: AliMusta2

Category:

more less

Transcript and Presenter's Notes

Title: Wormhole RTR FPGAs with Distributed Configuration Decompression

1
Wormhole RTR FPGAs with Distributed Configuration
Decompression

CSE-670 Final Project Presentation
A Joint Project Presentation by
Ali Mustafa Zaidi
Mustafa Imran Ali

2
Introduction

An FPGA configuration architecture supporting
distributed control and fast context switching.
Aim of Joint Project
Explore the potential for dynamically
reconfiguring FPGAs by adapting the WRTR Approach
Enable High Speed Reconfiguration using
Optimized Logic-Blocks and
Distributed Configuration Decompression
techniques
Study focuses on Datapath-oriented FPGAs

3
Project Methodology

In depth study of issues.
Definition of basic Architecture models.
Definition of Area models (for estimation of
relative overhead w.r.t other RTR schemes).
Design of Reconfigurable systems around designed
Architecture models.
Identification of FPGA resource allocation
methods
For distributing FPGA area between multiple
applications/hosts (at runtime, compile-time, or
system design-time).
Selection of Benchmarks for testing and
simulation of various approaches.
Experimentation with WRTR systems and
corresponding PRTR system without distributed
configuration (i.e. single host/application) for
comparison of baseline performance.
Evaluate resource utilization, and normalized
reconfiguration overhead etc.
Experimentation with all systems with distributed
configurations (i.e. multiple hosts/applications).
The PRTR systems configuration port will be
time multiplexed between the various
applications.
Evaluate resource utilization, and normalized
reconfiguration overhead etc.

4
Configuration Issues with Ultra-high Density FPGAs

FPGA densities rising dramatically with process
technology improvements.
Configuration time for serial method becoming
prohibitively large.
FPGAs are increasingly used as compute engines,
implementing data-intensive portions of
applications directly in Hardware.
Lack of efficient method for dynamic
reconfiguration of large FPGAs will lead to
inefficient utilization of the available
resources.

5
Scalability Issues with Multi-context RTR

Concept While one plane is operating, configure
the other planes serially.
Latency hidden by overlapping configuration with
computation
As FPGA size, and thus configuration time grows,
Multi context becomes less effective in hiding
latency
Configuring more bits in parallel for each
context is only a stop-gap solution.
Only so many pins can be dedicated to
configuration
Overheads in the Multi-context approach
Number of SRAM cells used for configuration grow
linearly with number of contexts
Multiplexing Circuitry associated with each
configurable unit
Global, low-skew context select wires.

6
Scalability Issues with Partial RTR

Concept The Configuration memory is Addressable
like standard Random Access Memory.
Overheads in PRTR Approach
Long global cell-select and data busses required
area overhead,
issues with single cycle signal-transmission as
wires grow relative to logic.
Vertical and Horizontal Decoding circuitry
represent centralized control resource.
Can be accessed only sequentially by different
user applications (one app at a time)
Potential for underutilization of hardware as
FPGA density increases.
One solution could be to design the RAM as a
multi-ported memory
But area of RAM increases quadratically with
increase in number of ports.
Only so many dedicated configuration ports can be
provided
Not a long-term scalable solution.

7
What is Wormhole RTR

WRTR is a method for reconfiguring a configurable
device in an entirely distributed fashion.
Routing and configuration handled at local
instead of global level.

8
Advertised Benefits of WRTR

WRTR is a distributed paradigm
Allows different parts of same resource to be
independently configured simultaneously.
Dramatically increases the configuration
bandwidth of a device.
Lack of centralized controller means
Fewer single point failures that can lead to
total system failure (e.g. a broken configuration
pin)
Increased resilience routing around faults
improving chip yields ?
Distributed control provides scalability
Eliminates configuration bottleneck.

9
Origins of Wormhole RTR

Concept Developed in late 90s at Virginia Tech.
Intended as a method of rapidly creating and
modifying custom computational pathways using a
distributed control scheme.
Essence of WRTR concept
Independent self-steering streams.
Streams carried both programming information as
well as operand data
Streams interact with architecture to perform
computation. (see DIAGRAM)

10
Origins of Wormhole RTR
11
Origins of Wormhole RTR

Programming information configures both the
pathway of stream through the system, as well as
the operations performed by computational
elements along the path.
Heterogeneity of architectures is supported by
these streams.
Runtime determination of path of stream is
possible, allowing allocation of resources as
they become available.

12
Adapting WRTR for conventional FPGAs

Our aims
To achieve Fast, Parallel Reconfiguration
With minimum area overhead
And minimum constraints imposed on the underlying
FPGA Architecture.
Configuration Architecture is completely
decoupled from the FPGA Architecture.
WRTR model is used as inspiration for developing
a new paradigm for dynamic reconfiguration
Not necessary that WRTR method is followed to the
letter

13
Issues Associated with using WRTR for
Conventional FPGAs

Original WRTR was intended for Coarse-grained
dataflow architectures with localized
communications
Thus operand data was appended to the streams
immediately after the programming header.
In conventional FPGAs, dataflow patterns are
unrelated to configuration flow, and there is no
restriction of localizing communications.
Therefore Wormhole routing is used only for
configuration (cannot be used for data).

14
Issues Associated with using WRTR for
Conventional FPGAs

The original model was intended to establish
linear pipelines through system.
This makes run-time direction determination
feasible.
However, for conventional FPGAs, the functions
implemented have arbitrary structures.
Configuration stream can not change direction
arbitrarily (i.e. fixed at compile-time).

15
Issues Associated with using WRTR for
Conventional FPGAs

Due to the need for large number of configuration
ports, I/O ports must be shared/multiplexed
thus active circuits may need to be stalled to
load configurations.
Should not be a severe issue for high-performance
computing oriented tasks.
Should impose minimum constraints on the
underlying FPGA architecture
Constraints applicable in an FPGA with WRTR are
same as those for any PRTR architecture.

16
A Possible System Architecture for a WRTR FPGA

Many configuration/IO ports, divided between
multiple host processors. (See Diagram)
Internally, FPGA divided into partitions, useable
by each of the hosts.
Partition boundaries may be determined at system
design time, or at runtime, based on requirements
of each host at any given time

17
The various WRTR Models derived

Our aim was to devise a high-speed, distributed
configuration model with all the benefits of the
original WRTR concept, but with minimum overhead.
To this end, 3 models have been devised
Basic with Full Internal Routing
Second with Perimeter-only Routing.
Third Packetized, or parallel configuration
streams, with no Internal Routing.

18
Basic WRTR Model with Internal Routing

Each configurable block or tile is accompanied
by a simple configuration Stream Router. See
Diagram
Overhead scales linearly with FPGA Size.
Expected Issues with this model
Complicated router, arbitration overhead and
prioritization, potential for deadlock conditions
etc.
May be restricted to coarser grained designs.
Without data routing, do we really need internal
routing?

19
Second WRTR with Perimeter-only Routing

Primary requirement for achieving parallel
configuration is multiple input ports.
Internal Routing not a mandatory requirement.
So why not restrict routing to chip boundary?
(See Diagram)
Overhead scaling improved (similar to PRTR Model)
Highlights
Finer granularity for configuration achievable
Significantly lower overheads as FPGA sizes grow
(ratio of perimeter to area)
Issues
Longer time required to reach parts of FPGA as
FPGA Size grows.
Reduced configuration parallelism because of
potentially greater arbitration delays at
boundary Routers.

20
Third Packet based distribution of Configuration

One solution to the increased boundary
arbitration issues use packets instead of
streams. (See Diagram)
A single configuration from each application is
generated as a stream (a worm) similar to
previous models.
Before entering device, configuration packets
from different streams are grouped according to
their target rows.

21
Third Packet based distribution of Configuration

Benefit No need at all for Routers in the fabric
itself.
Drawbacks
Increases overhead on the host system
Implies a centralized external controller
Or a limited crossbar interconnect within the
FPGA
Parallel Reconfiguration still possible, but with
limited multitasking.
This model may be considered for embedded
systems, with low configuration parallelism, but
high resource requirements.

22
Basic Area Model

Baseline model required to identify overheads
associated with each PRTR model.
Basic Building block (See Diagram)
A basic Array of SRAM Cells
Configured by arbitrary number of scan chains.
Assumptions for a fair comparison of overhead
Each RTR model studied has exactly the same
amount of Logic resources to configure (rows and
columns).
Each model can be configured at exactly the same
granularity.
The given array of logic resources (see Diagram)
has an area equivalent to a serially
configurable FPGA.
AREA of Basic Model (A B) (x y)

23
The PRTR FPGA Area Model

Please See Diagram
Configuration Granularity decided by A and B
AREA Area of Basic model Overheads
Overheads
Area of log2(x)-to-x Row-select decoder
Area of log2(y)-to-y n-bit Column
De-multiplexer
Area of 1 n-bit bus y

24
The basic WRTR FPGA Area Model

Please See Diagram
AREA Area of Basic Model Overheads
Overheads
Area of 1 n-bit bus 2x
Area of 1 n-bit bus 2y
Area of 1 4-D Router block x y.

25
The Perimeter Routing WRTR FPGA Area Model

Please See Diagram
AREA Area of Basic Model Overheads
Overheads
Area of 1 n-bit bus 2x
Area of 1 n-bit bus 2y
Area of 1 3-D Router block 2(x y) 1
This model can also be made one dimensional to
further reduce overheads. (other constraints will
apply)

26
The Packet based WRTR FPGA Area Model

Please See Diagram
AREA Area of Basic Model Overheads
Overheads
Area of 1 n-bit bus 2x
Area of 1 n-bit bus 2y
Additional overheads may appear in host system.
This model can also be made one dimensional to
further reduce overheads. (other constraints will
apply)

27
Parameters defined and their Impact

The number of Busses (x and y)
Number of Busses varies with reconfiguration
granularity.
For fixed logic capacity, A and B increase with
decreasing x and y, i.e. coarser granularity.
Impact of coarser granularity
Reduced overhead ?
Reduced reconfiguration flexibility
Increased reconfiguration time per block
Thus it is better to have finer granularity
The Width of the busses (n-bits)
Smaller the width, smaller the overhead (for
fixed number of busses)
Longer Reconfiguration times.

28
Parameters defined and their Impact

It is possible to achieve finer granularity
without increasing overhead at the cost of bus
width (and hence reconfiguration time per block)
Impact of Coarse grained vs. Fine grained
configurability Methods of Handling hazards in
the underlying FPGA fabric.
Coarse grained configuration places minimum
constraints on FPGA architecture
Fine-grained reconfiguration is subject to all
issues associated with Partial RTR Systems.

29
Approaches to Router Design

Active Routing Mechanism
Similar to conventional networks
Routing of streams depends on stream-specified
destination, as well as network metrics (e.g.
congestion, deadlock)
Hazards and conflicts may be dealt with at
Run-time.
Significantly complicated routing logic required.
Most likely will be restricted to very coarse
grained systems.
Passive Routing Mechanism
Routing of streams depends only on
stream-specified direction
Hazards and conflicts avoided by compile time
optimization.
We have selected the Passive Routing Mechanism
for our WRTR Models

30
Passive Router Details

Must be able to handle streams from 4 different
directions.
Streams from different directions only stalled if
there is a conflict in outgoing direction.
Includes mechanisms for stalling streams in case
of conflict etc.
Detecting and Applying back-pressure
Routing Circuitry for one port is defined (see
Diagram)
For a 4D router, this design is replicated 4
times
For a 3D router, this design is replicated 3
times

31
Utilizing Variable Length Configuration Mechanisms

Support Hardware and Logic Block Issues

32
Configuration Overhead

Configuration data is huge for large FPGAs
Has to be reduced in order to have fast context
switching

33
Configuration Overhead Minimization

Initial pointers in this direction
Variable Length Configurations
Default Configuration on Power-up

34
Variable Length Configurations

Ideally
Change only minimum number of bits for a new
configuration
Utilize the idea of short configurations for
frequently used configurations
Start from a default configuration and change
minimum bits to reconfigure to a new state

35
Hurdles

Logic blocks always require full configurations
to be specified configuration sizes cannot be
varied
Knowing only what to change requires keeping
track of what was configured before difficult
issue in multiple dynamic applications switching
A default power-up configuration can hardly be
useful for all application cases

36
Configuration Overhead Minimization

How to do it?
Remove redundancy in configuration data or
compact the contents of configuration stream
Result?
This will minimize the information required to
be conveyed during configuration or
reconfiguration
Configuration Compression
Applying some sort of compression to the
configuration data stream

37
Configuration Decompression Approaches

Centralized Approach
Decompress the configuration stream at the
boundary of the FPGA
Distributed Approach New Paradigm
Decompress the stream at the boundary of the
Logic Blocks or Logic Cluster

38
Centralized Approach

Advantage
Requires hardware only at the boundary of the
device from where the configuration data enters
the device
Significant reduction in configuration size can
be achieved
Runlength Coding and Lemple-Ziv based compression
used
Examples
Atmel 6000 Series
Zhiyuan Li, Scott Hauck, Configuration
Compression for Virtex FPGAs, IEEE Symposium on
FPGAs for Custom Computing Machines, 2001

39
Centralized Approach

Limitations
More efficient variable length coding not easy to
use because of the large number of symbol
possibilities
It is difficult to quantify symbols in the
configuration stream of heterogeneous devices
which can have different types of blocks

40
Decentralized Approach

Advantages
Decompressing at the logic block boundary enables
configurations to be easily symbolized and hence
VLC to be used
In other words, we know what exactly we are
coding so Huffman like codes can be used based on
the frequency of configuration occurrences
Also has advantages specific to Wormhole RTR
discussed next

41
Decentralized Approach

Limitations
The decompression hardware has to be replicated
Optimality Issue Decompression hardware should
be amortized over how much programmable logic
area?
In other words, granularity of the logic area
should be determined for optimal cost/benefit
ratio

42
Suitability of Decentralized Approach to WRTR

If worms are decompressed at the boundary, large
internal worms lengths will result
This leads to greater internal worm lengths and
greater issues to arbitration and worm blockages
Decentralized approach thus favors shorter worm
lengths and parallel worms to traverse with less
blockages

43
Variable Length Configuration

Overall idea
Frequently used configurations of a logic block
should have small sized codes
Variable length coding such as Huffman coding can
be adapted

44
Configuration Frequency Analysis

How to decide upon the frequency?
Hardwired? By the designer through benchmarks
analysis?
Generic? Done by software generating the
configuration stream

45
Continued

Hardwired determination will be inferior no
large benefit gained due to variations in
applications
Software that generates the configuration can
optimally identify a given number of frequently
used configuration according to set of
applications to be executed
Code determination should be done by software
generating the configurations for optimal codes

46
Decoding Hardware Approaches

Huffman coding the configurations
A Hardwired Huffman Decoder
Adaptive Decoder (code table can be changed)
Using a Table of Frequently used configurations
and address Decoder
Huffman Coding the Table Addresses
Static Coding
Adaptive Coding

47
Decoding Hardware Features

Static Huffman Decoders
Lower compression
Coding Configurations
Requires a very wide decoder
Using an Address Decoder only
Reduced hardware but less compression (fixed
sized codes)
Coding the Decoder Inputs
Requires a relatively smaller Huffman decoder

48
Some points to Note

Decompression approach is decoupled from any
specific logic block architecture
Though certain logic blocks will favor more
compression (discussed later)
Not every possible configuration will be coded.
Especially random logic portions will require all
the bits to be transmitted
A special code will prefix the random logic
configuration to identify it to be handled
separately

49
Logic Block Selection

High Level Issues
Should be Datapath Oriented
Efficient support for random logic implementation
High functionality with minimum configuration
bits to support dense implementation with
reconfiguration overhead reduction
Well defined datapath functionality
(configuration) to aid in the quantification of
frequently used configuration idea

50
Chosen Hi-Functionality Logic Block
51
A Low Functionality Logic Block
52
Logic Block Considerations

High functionality blocks ? good for datapath
implementations
Low Functionality blocks ? less dense datapath
implementations

53
Logic Block Considerations

Low functionality blocks have lesser
configurations bits/block and vice versa
Frequently Used Configurations memory cost
depends on the size of configurations stored
So decoder hardware overhead will be less for low
functionality blocks

54
Logic Block Considerations

What about the configuration time overhead?
Less dense functionality means more blocks to
configure
This leads to longer configuration streams

55
Logic Block Considerations

Assumption Random Logic does not benefit from
one block or the other
Consequence Datapath oriented designs will
require fewer blocks to configure with high
functionality blocks and even higher compression
and larger overhead for random logic
implementations than low-functionality blocks

56
Logic Block Issue Conclusions

Proper logic block selection for a particular
application affects
Decoder hardware size
Configuration compression ratio
Since Random logic is not compressed, using high
functionality blocks for less datapath oriented
applications will result in
A high decoder overhead unutilized
Less compression and longer configuration streams

57
Huffman decoder hardware

Basic Huffman hardware is sequential in nature
and is variable input rate variable output rate

58
Huffman Hardware

Sequential decoding not suitable for WRTR
Worm will have to be stalled
Negates the benefit of fast reconfiguration
Hardware should be able to process N bits at a
time where Nbus width
This requires a constant input rate architecture
with variable number of codes processed per cycle

59
Constant Input Rate PLA based Architecture

Input rate K bits/cycle
PLA undertakes table lookup process

Input bits
Determine one unique path along Huffman tree
Next state
feed back to the input of PLA to indicate the
final residing state
Indicator
The number of symbols decoded in the cycle

The constant-input rate PLA-based architecture
for the VLC decoder
60
PLA Based Architecture

Ref VLSI Designs for High Speed Huffman
Decoders, Shihfu Chang and David G.
Messerschimit
Decoder model-FSM
FSM Implemention
ROM
PLA
Lower complexity
high speed

61
Implementation Results

The area is a function of number of inputs and
outputs along with input rate

62
Hardware Area Estimates

The number of inputs and outputs depend upon the
maximum codelength and the symbol sizes
Typically, for a 16 entry table
Code Size range from 1 to 8 bits
Symbol size will equal the decoder input i.e. 4

63
Handling Multiple Symbols Outputs

More than one code will be decoded with a max of
N codes per cycle
To take full advantage of parallel decoding,
multiple configuration chains can be employed.
A counter can be used to cycle between the chains
and output one configuration per chain with a
maximum of N chains.

64
Decoder Hardware

For parallel decoding of N codes and M-bit
decoded symbols
Huffman Decoder (discussed before)
N M-bit decoders
N port ROM table
N parallel configuration chains

65
Concluding Points

The hardware overhead of the Huffman decoding
mechanism discussed has to be incorporated in the
area model discussed.
Empirical determination of reconfiguration
speed-up versus area overhead determination for a
sampling of benchmarks

Write a Comment

User Comments (0)