Coarse Grain Reconfigurable Architectures - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Coarse Grain Reconfigurable Architectures

Description:

Chess. Hexagonal Mesh (Chess board layout) of ALU's. Matrix. 2-D array of ALUs. Rapid ... The CHESS Architecture Interconnection ... – PowerPoint PPT presentation

Number of Views:595
Avg rating:3.0/5.0
Slides: 69
Provided by: eceAr
Category:

less

Transcript and Presenter's Notes

Title: Coarse Grain Reconfigurable Architectures


1
Coarse Grain Reconfigurable Architectures
2
Announcements
  • Group meetings this week (sign up sheet)
  • Starting October 8th PSYCH304, 330-6pm
  • Today Coarse Grain Architectures

3
Motivation for coarse grained architectures
  • Definition
  • FPGA with granularity greater than 1 bit.
  • Architectures seen so far vary from 2 bits to 32
    bits.
  • Disadvantages of low granularity
  • Large routing area (80 of chip area!).
  • Large volume of configuration data.
  • Low area efficiency for arithmetic operations and
    RAM.
  • Reduced clock speed, bandwidth.
  • Advantages of coarse grained architectures
  • Lesser number of PEs so PAR problem is less
    complex and is much faster.
  • Lesser area of chip devoted to routing.
  • Easier to correlate software functions and
    hardware.
  • Tradeoffs
  • Flexibility/Generality of 1 bit configuration.
  • Possible under utilization.

4
Coarse grained architectures
  • DP-FPGA
  • LUT-based
  • LUTs share configuration bits
  • Chess
  • Hexagonal Mesh (Chess board layout) of ALUs
  • Matrix
  • 2-D array of ALUs
  • Rapid
  • Specialized ALUs, mutlipliers
  • 1D array
  • Raw
  • Full RISC core as basic block
  • 2D mesh
  • Paddi
  • Cluster of 8 arithmetic EXU, 16 bits wide, 8 word
    SRAM
  • Central Crossbar switch

5
Different Coarse Grained Architectures
ALU
DPFPGA
CHESS
MATRIX
PADDI
RAPID
1D array, 16 bit
RAW
2D mesh, 8bit, imem,dmem,cl
6
RAW, Motivation
  • It takes on the order of two clock cycles for a
    signal to travel from edge-to-edge (roughly
    fifteen mm) of a 2-GHz processor
  • Compaqs Alpha 21264 forced to split the integer
    unit into two physically dispersed clusters, with
    a one-cycle penalty for communication of results
    between clusters.
  • Intel Pentium 4 architects had to allocate two
    pipeline stages solely for the traversal of long
    wires.
  • RAW
  • to use a scalable ISA that provides a parallel
    software interface to the gate, wire, and pin
    resources of a chip.
  • An architecture with direct, first-class analogs
    to all of these physical resources lets
    programmers extract the maximum amount of
    performance and energy efficiency in the face of
    wire delay.
  • try to minimize the ISA gap by exposing the
    underlying physical resources as architectural
    entities.

7
What is
  • The Raw microprocessor consumes 122 million
    transistors
  • executes 16 different load, store, integer, or
    floating-point instructions every cycle
  • controls 25 Gbytes/s of input/output (I/O)
    bandwidth and
  • has 2 Mbytes of on-chip distributed L1 static RAM
  • providing on-chip memory bandwidth of 57
    Gbytes/s.
  • it took only a handful of graduate students at
    the Laboratory for Computer Science at MIT to
    design and implement Raw.

8
RAW Reconfigurable Architecture Workstation
(_at_MIT)
  • Challenges
  • Short internal chip wires to have high clock
    speed
  • Quick verification of new designs
  • Workloads on processors (multimedia)
  • Multi-granular Processing Elements (Tiles)
  • a RISC processor,
  • Configurable Logic (CL),
  • instruction and data memories,
  • programmable switch
  • Parallelizing compiler (to distribute workload)
  • Supports static and dynamic routing

9
RAW Microprocessor
Each tile includes a RISC processor and
configurable logic
10
RAW - Comparison
11
RAW
12
RAW Datapath of individual tile
  • Intertile communication latency similar to
    those of register accesses
  • Distributed registers More ILP
  • Memory access time close to processor clock
  • No hardware based register-renaming logic or
    instruction issue (different from superscalar)
  • Focus is on computation (NOT control)
  • Switch integrated into processor pipeline
  • Static and dynamic schedule
  • Multigranular and configurability

13
RAW Pipeline
14
RAW Software Support
15
RAW Compiler
  • (a) shows the original code. (b) shows the
    memories of a 4 processor Raw machine, showing
    the distribution of array A. (c) shows the code
    after unrolling. After unrolling, each access
    refers to locations from only one processor.

16
RAW parallelizing compiler
Parallelizing applications into silicon. 1999
17
RAW Experimental Results
18
RAW - Performance
19
RAW fabricated
20
RAW vs. FPGA
  • FPGAs
  • exploit fine-grained parallelism and fast static
    communication.
  • Software has access to its low-level details,
    allowing the software to optimize mapping of the
    user application.
  • Users can also bind commonly used instruction
    sequences into configurable logic.
  • As a result, these special purpose instruction
    sequences can execute in a single cycle.
  • require loading an entire bitstream to reprogram
    the FPGAs for a new operation.
  • Compilation for FPGAs is also slow because of
    their fine granularity.
  • Raw
  • supports instruction sequencing.
  • They are more flexible, merely pointing to a new
    instruction to execute a new operation.
  • Compilation in Raw is fast because the hardware
    contains commonly used compute mechanisms such as
    ALUs and memory paths.
  • This eliminates repeated, low-level compilations
    of these units.
  • Binding common mechanisms into hardware also
    yields faster execution speed,lower area, and
    better power efficiency than FPGA systems.

21
CHESS
  • A Reconfigurable Arithmetic Array for Multimedia
    Applications

22
The CHESS Architecture -Introduction
  • Developed by HP.
  • Aims at speeding up arithmetic operations for
    multimedia applications.
  • Also tries to improve memory density.
  • Salient features
  • 4 bit Alus Why 4 bit?
  • 4 bit buses
  • Switchboxes-2 modes
  • Chessboard layout strong local connectivity
  • Embedded Block RAMs- 256 8 per 16 ALUs
  • Speed and Hierarchical line lengths-buffers
  • Small configuration memories speed of config is
    high
  • No run time reconfiguration-Why?

23
Application Benefits of CHESS
  • High Computational Density _single chip
  • Delayed design Commitment
  • Wide in chip interfaces
  • Doesnt do runtime reconfig
  • Memory Bandwidth
  • Distributed on chip memory
  • Flexibility
  • Convert switchbox to RAM RAM routability
    tradeoff- prefer embedded memory
  • Different ALU instructions can be programmed
  • Rapid Reconfiguration

24
The CHESS Layout of ALUs
  • 4 bit Alus
  • 4 bit buses
  • Hexagonal Array
  • Local Connectivity good

25
Distributed RAM
  • Distributed memory
  • 1 RAM for 16 ALU blocks
  • Cascaded ALU s possible

26
ALU bit slice and sample instruction set
  • Performs addition, subtraction and logical
    operations
  • 4 bit inputs A B
  • Single bit carry
  • Variety of carry conditioning options

Carry input -data signal- arithmetic Carry input
-control signal-Testing equality of more than 4
bit numbers Carry input -control signal-can
drive local resets,enables,etc
27
ALU and Switchbox
  • Switchbox memory can be used as storage as shown
    A is Address, B is Wr_data and F is Rd_data
  • Additional 2 bit registers for use as buffers for
    long wires
  • ALU core for computation
  • ALU instruction can be changed
  • on a cycle to cycle basis using I

28
The CHESS Architecture Interconnection
First 2 for local connections Next 2 Connected to
long wires as feeders L1 (2) diagonally adjacent
sw boxes L2 (4) Connect to the ends of feeder
buses L4 (2) L8 (2) L16 (2)
  • Different type of wire segments depending on
    length of connection
  • Rents rule

29
DP-FPGA
  • For regularly structured datapaths like that used
    in DSP and communication.

Control- Regular FPGA Memory- Banks of
SRAM Datapath- 4bit Data -
1bit - Control Programming bit sharing Carry
bit chain
30
DPFPGA vs. CHESS
  • Both shared configuration bits and dedicated
    carry chains
  • DPFPGA used LUTs vs. ALUs for Chess
  • CHESS used uniform/balanced wiring
  • DPFPGA used dedicated shifter

31
Performance of CHess
  • Use metrics to evaluate computational power.
  • Efficient multiplies due to embedded ALU
  • Process independent estimate

32
Summary The CHESS Architecture
  • Achieves high computational density
  • Chess is a highly flexible and scalable design.
  • It can feed ALUs with instructions generated
    within the array
  • has embedded block RAM and can trade routing
    switches for memory.

33
General-purpose computing two important
questions
  • 1. How are general-purpose processing resources
    controlled?
  • 2. How much area is dedicated to holding the
    instructions which control these resources?
  • SIMD, MIMD, VLIW, FPGA, reconfigurable ALUs

MATRIX reconfigurable device architecture
which allows these questions to be answered by
the application rather than by the device
architect.
34
Another way of criticizing FPGAs
  • allow finer granularity control over operation
    and dedicate minimal area to instruction
    distribution.
  • they can deliver more computations per unit
    silicon than processors on a wide range of
    regular operations.
  • However, the lack of resources for instruction
    distribution make them efficient only when the
    functional diversity is low
  • i.e. when the same operation is required
    repeatedly
  • and that entire operation can be fit spatially
    onto the FPGA or FPGAs in the system.

35
MATRIX approach
  • Rather than separate the resources for
    instruction storage data storage and computation
  • dedicate silicon resources to them at fabrication
    time, the MATRIX architecture unifies these
    resources.

36
MATRIX approach
  • traditional instruction and control resources are
    decomposed along with computing resources and can
    be deployed in an application-specific manner.
  • support active computation or to control reuse of
    computational resources depending on the needs of
    the application and the available hardware
    resources.

37
The MATRIX architecture
  • Developed at MIT
  • 2D array of ALUs
  • 8-bit Granularity
  • Each Basic Functional Unit contains ALU and
    Memory
  • Ideal for systolic and VLIW computation.
  • Unified Configurable Network

38
The MATRIX Basic Functional Unit
Basic Functional Unit (BFU)
  • Granularity (8 bit)
  • Contains an ALU, Memory and Control Logic
  • Memory for instructions (configurations), data.

39
The MATRIX Interconnection Network
Interconnection Network
  • Hierarchical Interconnection Network
  • Level Two interconnect medium distance
    interconnection
  • Global Lines Spanning entire row / column

40
The MATRIX port structure
  • Different Sources of ALU inputs
  • Static Value Mode
  • Static Source Mode
  • Dynamic Source Mode

Basic Port Architecture
41
Convolution Mapping in MATRIX -Systolic Array
Example of Convolution Mapping
  • The sample values are passed through the first
    row
  • Produces the result every two cycles
  • Needs 4k cells to implement

42
Convolution Mapping in MATRIX Microcoded example
Functional diversity vs. performance uP vs FPGA?
Example of Convolution Mapping (Microcoded)
  • Co-efficients stored in BFU memory
  • Takes 8K 9 cycles to implement
  • Consumes only 8 cells

43
Convolution Mapping in MATRIX VLIW/MSIMD
Separate BFU for Xptr and Wptr
Example of Convolution Mapping (VLIW)
Perform 6 convolutions simultaneously (MSIMD)
44
Granularity Performance
  • 8 bit granularity optimal but bit level
    operations will result in underutilization
  • Extremely flexible

N/w 50 Mem- 30 Control-12 ALU- 8
45
Summary The Matrix Architecture
  • Application tailored datapaths
  • Dynamic control
  • Regularity is exploitable
  • Deployable resources
  • High Flexibility in implementation
  • Instruction stream compression

46
Chess Matrix
  • Both Use 2D array of ALUs
  • For both Instructions can be generated within the
    array
  • Both are flexible
  • Chess is 4 bit Matrix is 8 bit. What is the
    Tradeoff?
  • Chess does not support reconfiguration but has
    very fast configuration because of few bits
    required
  • Chess has high computational density
  • Chess is aimed at arithmetic matrix is more
    general purpose

47
RaPiD
  • Reconfigurable Pipelined Datapath
  • optimized for highly repetitive,
    computation-intensive tasks.
  • Very deep application-specific computation
    pipelines can be configured in RaPiD
  • pipelines make much more efficient use of silicon
    than traditional FPGAs
  • Yield much higher performance for a wide range of
    applications

48
RaPiD - Reconfigurable Pipelined
DatapathMotivation
  • University of Washington
  • Motivation
  • - Configurable Computing performance close to
    ASIC
  • - Flexibility close to General Purpose
    Processor
  • Yet FPGA-based systems have some problems.
  • Platform for implementing random functions
  • Automatic Compilation (High-level synthesis)
  • Suitable for DSP applications
  • Datapath with static and dynamic signals

49
RaPiD
  • Coarse-grained FPGA architecture that allows
    deeply pipelined computational datapaths to be
    constructed dynamically from a mix of ALUs,
    multipliers, registers and local memories.
  • The goal of RaPiD is to compile regular
    computations like those found in DSP applications
    into both an application-specific datapath and
    the program for controlling that datapath

50
RaPiD - Datapath
  • Uses static and dynamic control signals.
  • Static control determines the underlying
    structure of the datapath that remains constant
    for a particular application.
  • generated by static RAM cells that are changed
    only between applications
  • Dynamic control signals can change from cycle to
    cycle and specify the variable operations
    performed and the data to be used by those
    operations.
  • provided by a control program.

51
RaPiD- Introduction
  • Computational bandwidth is extremely high and
    scales with array size
  • I/O operations limit the speedup an application
    can get
  • Suited for highly repetitive, computational
    intensive tasks
  • Control flow to be taken care by another
    processor (RISC)
  • Restricted to linear arrays, NOT a 2D
    architecture

52
RaPiD Basic Block (Cell)
  • Each Cell contains 1 integer (16-bit)
    multiplier, 2 ALU, 6 GP Registers, 3small local
    memories
  • Complete array (RaPiD-I) consists of 16 cells
  • Ten segmented busses run through the length of
    the datapath

53
RaPiD Interconnect
Bus connector
  • Input to any functional unit is driven through a
    multiplexer (8 lines)
  • Output of a functional unit can span out to any
    number of busses (8)
  • Busses in different tracks are segmented to
    different lengths
  • Bus connector is used to connect to adjacent
    segments through register or buffer (either
    direction)
  • These registers can also be used for pipelining

54
RaPiD More about Cells
  • Functional Unit outputs registered/unregistered
  • Granularity 16 bit, Different Fixed-point
    representations
  • ALUs logical and arithmetic operations
  • 2 ALUs and a multiplier pipelined for MAC
    operation (32 bit)
  • Datapath registers Expensive (area and bus
    utilization)
  • Used to connect to different tracks
  • Local memory Can be used to store constant
    arrays
  • Includes address generators
  • I/O streams for data transfer (FIFO)
    asynchronous

55
RaPiD Register, Memory
Datapath Register
Local Memory
56
RaPiD Control Path
  • control pipeline
  • static control
  • dynamic control. (Example Initialization)
  • LUTs provide simple programmability.
  • Cells can be chained together to form continuous
    pipe.
  • 230 control signals/cell. 80 are dynamic.

57
RaPiD Weighted Filter Example
  • FIR filter algorithm 4 taps
  • Filter weights are stored in W array
  • Input is stored in X array

58
RaPiD Mapping to RaPiD (a)
Schematic
59
RaPiD Mapping to RaPiD (b)
  • Performance depends upon clock rate (t in MHz),
    Number of Cells (S) and Memory locations/cell (M)
  • Results in MOPS/GOPS, single MAC is an operation
  • FIR filter with T taps t min (T,S) MOPS
  • - For t100, S16, M96, Tgt16 -gt sustained rate
    1.6 GOPS
  • Matrix multiplication of X x Y and Y x Z
  • Sustained rate is t min (Y,M/3,S) MOPS
  • Conclusions No compiler, I/O challenges,
    Integrating with RISC

60
Issues
  • The domain of applicability must be explored by
    mapping more problems from different domains to
    RaPiD.
  • Thus far all RaPiD applications have been
    designed by hand. The next step will be to apply
    compiler technology, particularly
    loop-transformation theory and systolic array
    compiling methods to build a compiler for RaPiD.
  • A memory architecture must be designed which can
    support the I/O band-width required by RaPiD over
    a wide range of applications.
  • Although it is clear that RaPiD should be closely
    coupled to a generic RISC processor, it is not
    clear exactly how this should be done. This is a
    problem being faced by other reconfigurable
    computers.

61
RaPiD OFDM implementation
Rapid C (assembly like programming) used for
mapping.
Implementing OFDM Receiver on the RaPiD
Reconfigurable Architecture. 2003
62
RaPiD Performance (OFDM)
Performance Results for OFDM implementation
63
RaPiD Performance
64
RaPiD Performance
  • A 16-cell RaPiD array 8x8 2D-DCT two matrix
    multiplies
  • images gt 256x256 pixels, sustained rate16
    billion MACs,
  • including reconfiguration overhead between
    images, with an average of 2 memory accesses per
    cycle.
  • Motion estimation matrix multiply and DCT,
  • sustained rate of 16 billion difference/absolute
    value/accumulate operations per second
  • but with an average of 0.1 memory accesses per
    cycle.
  • Motion picture compression (both motion
    estimation and DCT on each frame.
  • reconfiguration time 2000 cycles (20 sec.),
  • little performance is lost to reconfiguration and
    pipeline stalling
  • For a standard 720x576 frame12 frames per second
    when executing both full motion estimation and
    DCT (including 4000 reconfiguration cycles per
    frame and pipelineing).

65
Comparison with existing architectures
  • PADDI -16bit
  • Advanced fine granularity VEGA,Dharma
  • DPFPGA

66
VEGA
  • Virtual Element Gate Array
  • Designed for circuit emulation
  • Multiplex a single LUT over time to simulate an
    array of LUTs

67
Dharma
68
PADDI
  • Each EXU contains dedicated
  • hardware support for fast 16 bit arithmetic
  • Global broadcasting
  • Local memory in the form of register files within
    EXUs
Write a Comment
User Comments (0)
About PowerShow.com