Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs* - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*

Description:

Experiment with different dataflow and locality properties by changing radix and permutations ... Mixed-radix counting permutation of vector indices ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 38
Provided by: JeremyR91
Category:

less

Transcript and Presenter's Notes

Title: Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*


1
Parallel Processing (CS 730) Lecture 9
Distributed Memory FFTs
  • Jeremy R. Johnson
  • Wed. Mar. 1, 2001
  • Parts of this lecture was derived from material
    from Johnson, Johnson, Pryor.

2
Introduction
  • Objective To derive and implement a
    distributed-memory parallel program for computing
    the fast Fourier transform (FFT).
  • Topics
  • Derivation of the FFT
  • Iterative version
  • Pease Algorithm Generalizations
  • Tensor permutations
  • Distributed implementation of tensor permutations
  • stride permutation
  • bit reversal
  • Distributed FFT

3
FFT as a Matrix Factorization
  • Compute y Fnx, where Fn is n-point Fourier
    matrix.

4
Matrix Factorizations and Algorithms
function y fft(x) n length(x) if n 1
y x else x0 x1 Ln_2 x x0
x(12n-1) x1 x(22n) t0 t1 (I_2
tensor F_m)x0 x1 t0 fft(x0) t1
fft(x1) w W_m(omega_n) w
exp((2pii/n)(0n/2-1)) y y0 y1 (F_2
tensor I_m) Tn_m t0 t1 y0 t0 w.t1 y1
t0 - w.t1 y y0 y1 end
5
Rewrite Rules
6
FFT Variants
  • Cooley-Tukey
  • Recursive FFT
  • Iterative FFT
  • Vector FFT (Stockham)
  • Vector FFT (Korn-Lambiotte)
  • Parallel FFT (Pease)

7
Example TPL Programs
  • Recursive 8-point FFT
  • (compose (tensor (F 2) (I 4)) (T 8 4)
  • (tensor (I 2)
  • (compose (tensor (F 2) (I 2)) (T
    4 2)
  • (tensor (I 2) (F 2)) (L
    4 2))
  • (L 8 2))
  • Iterative 8-point FFT
  • (compose (tensor (F 2) (I 4)) (T 8 4)
  • (tensor (I 2) (F 2) (I 2)) (tensor (I 2)
    (T 4 2))
  • (tensor (F 2) (I 4))
  • (tensor (I 2) (L 4 2)
  • (L 8 2))

8
FFT Dataflow
  • Different formulas for the FFT have different
    dataflow (memory access patterns).
  • The dataflow in a class of FFT algorithms can be
    described by a sequence of permutations.
  • An FFT dataflow is a sequence of permutations
    that can be modified with the insertion of
    butterfly computations (with appropriate twiddle
    factors) to form a factorization of the Fourier
    matrix.
  • FFT dataflows can be classified wrt to cost, and
    used to find good FFT implementations.

9
Distributed FFT Algorithm
  • Experiment with different dataflow and locality
    properties by changing radix and permutations

10
Cooley-Tukey Dataflow
11
Pease Dataflow
12
Tensor Permutations
  • A natural class of permutations compatible with
    the FFT. Let ? be a permutation of 1,,t
  • Mixed-radix counting permutation of vector
    indices
  • Well-known examples are stride permutations and
    bit-reversal.

?
13
Example (Stride Permutation)
  • 000 000
  • 001 100
  • 010 001
  • 011 011
  • 100 010
  • 101 110
  • 110 101
  • 111 111

14
Example (Bit Reversal)
  • 000 000
  • 001 100
  • 010 010
  • 011 110
  • 100 001
  • 101 101
  • 110 011
  • 111 111

15
Twiddle Factor Matrix
  • Diagonal matrix containing roots of unity
  • Generalized Twiddle (compatible with tensor
    permutations)

16
Distributed Computation
  • Allocate equal-sized segments of vector to each
    processor, and index distributed vector with pid
    and local offset.
  • Interpret tensor product operations with this
    addressing scheme

17
Distributed Tensor Product and Twiddle Factors
  • Assume P processors
  • In?A, becomes parallel do over all processors
    when n ? P.
  • Twiddle factors determined independently from pid
    and offset. Necessary bits determined from I, J,
    and (n1,,nt) in generalized twiddle notation.

18
Distributed Tensor Permutations
19
Classes of Distributed Tensor Permutations
  • Local (pid is fixed by ?)
  • Only permute elements locally within each
    processor
  • Global (offset is fixed by ?)
  • Permute the entire local arrays amongst the
    processors
  • GlobalLocal (bits in pid and bits in offset
    moved by ?, but no bits cross the pid/offset
    boundary)
  • Permute elements locally followed by a Global
    permutation
  • Mixed (at least one offset and pid bit are
    exchanged)
  • Elements from a processor are sent/received
    to/from more than one processor

20
Distributed Stride Permutation
  • 0000 0000
  • 0001 1000
  • 0010 0001
  • 0011 1001
  • 0100 0010
  • 0101 1010
  • 0110 0011
  • 0111 1011
  • 1000 0100
  • 1001 1100
  • 1010 0101
  • 1011 1101
  • 1100 0110
  • 1101 1110
  • 1110 0111
  • 1111 1111

21
Communication Pattern
22
Communication Pattern
Each PE sends 1/2 data to 2 different PEs
23
Communication Pattern
Each PE sends 1/4 data to 4 different PEs
24
Communication Pattern
Each PE sends1/8 data to 8 different PEs
25
Implementation of Distributed Stride Permutation
D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y
LN_S X // Inputs // Y,X distributed vectors of
size N 2t, // with M 2l elements per
processor // P 2k number of processors //
S 2j, 0 lt j lt k, is the stride. //
Output // Y LN_S X p pid for
i0,...,2j-1 do put x(iSiS(n/S-1)) in
y((n/S)(p mod S)(n/S)(p mod S)N/S-1) on
PE p/2j i2k-j
26
Cyclic Scheduling
Each PE sends 1/4 data to 4 different PEs
27
Distributed Bit Reversal Permutation
  • Mixed tensor permutation
  • Implement using factorization

b7b6b5 b4b3b2b1b0
b0b1b2 b3b4b5b6b7
b7b6b5 b4b3b2b1b0
b5b6b7 b0b1b2b3b4
28
Experiments on the CRAY T3E
  • All experiments were performed on a 240 node
    (8x4x8 with partial plane) T3E using 128
    processors (300 MHz) with 128MB memory
  • Task 1(pairwise communication)
  • Implemented with shmem_get, shmem_put, and
    mpi_sendrecv
  • Task 2 (all 7! 5040 global tensor permutations)
  • Implemented with shmem_get, shmem_put, and
    mpi_sendrecv
  • Task 3 (local tensor permutations of the form I ?
    L ? I on vectors of size 222 words - only run on
    a single node)
  • Implemented using streams on/off, cache bypass
  • Task 4 (distributed stride permutations)
  • Implemented using shmem_iput, shmem_iget, and
    mpi_sendrecv

29
Task 1 Performance Data
30
Task 2 Performance Data
31
Task 3 Performance Data
32
Task 4 Performance Data
33
Network Simulator
  • An idealized simulator for the T3E was developed
    (with C. Grassl from Cray research) in order to
    study contention
  • Specify processor layout and route table and
    number of virual processors with a given start
    node
  • Each processor can simultaneously issue a single
    send
  • Contention is measured as the maximum number of
    messages across any edge/node
  • Simulator used to study global and mixed tensor
    permutations.

34
Task 2 Grid Simulation Analysis
35
Task 2 Grid Simulation Analysis
36
Task 2 Torus Simulation Analysis
37
Task 2 Torus Simulation Analysis
Write a Comment
User Comments (0)
About PowerShow.com