Why Systolic Architecture - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Why Systolic Architecture

Description:

I/O and computation imbalance is a notable problem. ... Systolic system is easy to implement because of its. regularity and easy to reconfigure. ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 19
Provided by: siddh1
Category:

less

Transcript and Presenter's Notes

Title: Why Systolic Architecture


1
Why Systolic Architecture ?
2
Motivation Introduction
  • We need a high-performance , special-purpose
    computer
  • system to meet specific application.
  • I/O and computation imbalance is a notable
    problem.
  • The concept of Systolic architecture can map
    high-level
  • computation into hardware structures.
  • Systolic system works like an automobile assembly
    line.
  • Systolic system is easy to implement because of
    its
  • regularity and easy to reconfigure.
  • Systolic architecture can result in
    cost-effective , high-
  • performance special-purpose systems for a wide
    range
  • of problems.

3
Key architectural issues in designing
special-purpose systems
  • Simple and regular design
  • Simple, regular design yields
    cost-effective special
  • systems.
  • Concurrency and communication
  • Design algorithm to support high
    concurrency and
  • meantime to employ only simple.
  • Balancing computation with I/O
  • A special-purpose system should be match a
    variety
  • of I/O bandwidth.

4
Basic principle of systolic architecture
  • Systolic system consists of a set interconnected
  • cells , each capable of performing some simple
  • operation.
  • Systolic approach can speed up a compute-bound
  • computation in a relatively simple and
    inexpensive
  • manner.
  • A systolic array in particular , is illustrated
    in next
  • page. (we achieve higher computation throughput
  • without increasing memory bandwidth)

5
Basic principle of a systolic system
6
CONVOLUTION
  • In mathematics and, in particular, functional
    analysis, convolution is a mathematical operator
    which takes two functions f and g and produces a
    third function that, in a sense, represents the
    amount of overlap between f and a reversed and
    translated version of g.
  • Typically, one of the functions is taken to be a
    fixed filter impulse response, and is known as a
    kernel. Such a convolution is a kind of
    generalized moving average, as one can see by
    taking the kernel to be an indicator function of
    an interval.

7
CONVOLUTION
  • Visual explanation of convolution. Make each
    waveform a function of the dummy variable t.
    Time-invert one of the waveforms and add t to
    allow it to slide back and forth on the t-axis
    while remaining stationary with respect to t.
    Finally, start the function at negative infinity
    and slide it all the way to positive infinity.
    Wherever the two functions intersect, find the
    integral of their product. The resulting waveform
    (not shown here) is the convolution of the two
    functions. If the stationary waveform is a unit
    impulse, the end result would be the original
    version of the sliding waveform, as it is
    time-inverted back again because the right edge
    hits the unit impulse first and the left edge
    last. This is also the reason for the
    time-inversion in general, as complex signals can
    be thought to consist of unit impulses.

8
CONVOLUTION
  • Discrete convolution
  • For discrete functions, one can use a discrete
    version of the convolution operation. It is given
    by
  • When multiplying two polynomials, the
    coefficients of the product are given by the
    convolution of the original coefficient
    sequences, in this sense (using extension with
    zeros as mentioned above).
  • Generalizing the above cases, the convolution can
    be defined for any two integratable functions
    defined on a locally compact topological group.
  • A different generalization is the convolution of
    distributions.
  • Evaluating discrete convolutions using the above
    formula applied directly takes O(N2) arithmetic
    operations for N points, but this can be reduced
    to O(N log N) using a variety of fast algorithms.

9
CONVOLUTION
  • Code
  • include ltstdio.hgt
  • include ltstdlib.hgt
  • int main( )
  • int w 1,2,2,1
  • int x 11,2,3,4,5,6,3,2,1
  • int y 20
  • int w_len 4
  • int x_len 9
  • int i, j, temp
  • for( i1 i lt (w_lenx_len), i)
  • yi 0
  • for( i1 i lt (w_lenx_len - 1) i)
  • for( j 1 j lt w_len j)
  • if ( ( i j lt 0 ) ( i j gt (x_len -1 ) )
    )
  • temp 0
  • else
  • temp x i j

10
Design B1
  • Previously propose for cir-cuits to implement a
    pattern matching processor and for circuit to
    implement polyno-mial multiplication.

- Broadcast input , move results , weights stay -
(Semi-systolic convolution arrays with global
data communication
11
CONVOLUTION
  • y1 x1.w1
  • y2 x2.w1 x1.w2
  • y3 x3.w1 x2.w2 x1.w3
  • y4 x4.w1 x3.w2 x2.w3 x1.w4
  • y5 x5.w1 x4.w2 x3.w3 x2.w4
  • y6 x6.w1 x5.w2 x4.w3 x3.w4
  • y7 x7.w1 x6.w2 x5.w3 x4.w4 y10
    x9.w2 x8.w3 x7.w4
  • y8 x8.w1 x7.w2 x6.w3 x5.w4 y11
    x9.w3 x8.w4
  • y9 x9.w1 x8.w2 x7.w3 x6.w4 y12
    x9.w4

12
Design B2
  • The path for moving yis is wider then wis
    because of yis carry more bits then wis in
    numerical accuracy.
  • The use of multiplier-accumlators may also help
    increase precision of the result , since extra
    bit can be kept in these accumulators with modest
    cost.

Broadcast input , move weights , results
stay (Semi-) systolic convolution arrays with
global data communication
13
CONVOLUTION
  • y1 x1.w1
  • w1 y2 x2.w1 x1.w2
  • w1 w2 y3 x3.w1 x2.w2 x1.w3
  • w1 w2 w3 y4 x4.w1 x3.w2 x2.w3 x1.w4
  • w1 w2 w3 y5 x5.w1 x4.w2 x3.w3 x2.w4
  • y6 x6.w1 x5.w2 x4.w3 x3.w4
  • y7 x7.w1 x6.w2 x5.w3 x4.w4 y10
    x9.w2 x8.w3 x7.w4
  • y8 x8.w1 x7.w2 x6.w3 x5.w4 y11
    x9.w3 x8.w4
  • y9 x9.w1 x8.w2 x7.w3 x6.w4 y12
    x9.w4

14
Design F
  • When number of cell is large , the adder can be
    implemented as a pipelined adder tree to avoid
    large delay.
  • Design of this type using unbounded fan-in.

- Fan-in results, move inputs, weights stay -
Semi-systolic convolution arrays with global data
communication
15
CONVOLUTION
  • y1 x1.w1
  • y2 x2.w1 x1.w2
  • y3 x3.w1 x2.w2 x1.w3
  • y4 x4.w1 x3.w2 x2.w3 x1.w4
  • y5 x5.w1 x4.w2 x3.w3 x2.w4
  • y6 x6.w1 x5.w2 x4.w3 x3.w4
  • y7 x7.w1 x6.w2 x5.w3 x4.w4
  • y8 x8.w1 x7.w2 x6.w3 x5.w4
  • y9 x9.w1 x8.w2 x7.w3 x6.w4
  • y10 x9.w2 x8.w3 x7.w4
  • y11 x9.w3 x8.w4
  • y12
    x9.w4

16
Design R1
  • Design R1 has the advan-tage that it dose not
    require a bus , or any other global net-work ,
    for collecting output from cells.
  • The basic ideal of this de-sign has been used to
    imple-ment a pattern matching chip.

- Results stay, inputs and weights move in
opposite directions - Pure-systolic convolution
arrays with global data communication
17
CONVOLUTION
  • x1 y1 x1.w1 w1
  • x2 y2 x2.w1 x1.w2 w2
  • x3 y3 x3.w1 x2.w2 x1.w3 w3
  • x4 y4 x4.w1 x3.w2 x2.w3 x1.w4 w4
  • x5 y5 x5.w1 x4.w2 x3.w3 x2.w4
  • x6 y6 x6.w1 x5.w2 x4.w3 x3.w4
  • y7 x7.w1 x6.w2 x5.w3 x4.w4 y10
    x9.w2 x8.w3 x7.w4
  • y8 x8.w1 x7.w2 x6.w3 x5.w4 y11
    x9.w3 x8.w4
  • y9 x9.w1 x8.w2 x7.w3 x6.w4 y12
    x9.w4

18
CONVOLUTION
  • Description
  • For the sequence of computations shown on the
    previous page, design a structural VHDL code such
    that the computation is fully pipelined. xi
    represents a single datum input to circuit from
    the testbench, while yi is the output back to
    the testbench. New values of xi move into the
    circuit every clock cycle, and new values of yi
    move out to the testbench every clock cycle.
    Clock is another input ot the system.
  • The individual components of the system should be
    described behaviorally. 25
  • Show a plan for architecting your design for a
    pipelined implementation. 25
  • Write the top level VHDL structural code for the
    design. 25
  • Write a testbench for the system. 25
Write a Comment
User Comments (0)
About PowerShow.com