Title: Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*
1Parallel Processing (CS 730) Lecture 9
Distributed Memory FFTs
- Jeremy R. Johnson
- Wed. Mar. 1, 2001
- Parts of this lecture was derived from material
from Johnson, Johnson, Pryor.
2Introduction
- Objective To derive and implement a
distributed-memory parallel program for computing
the fast Fourier transform (FFT). - Topics
- Derivation of the FFT
- Iterative version
- Pease Algorithm Generalizations
- Tensor permutations
- Distributed implementation of tensor permutations
- stride permutation
- bit reversal
- Distributed FFT
3FFT as a Matrix Factorization
- Compute y Fnx, where Fn is n-point Fourier
matrix.
4Matrix Factorizations and Algorithms
function y fft(x) n length(x) if n 1
y x else x0 x1 Ln_2 x x0
x(12n-1) x1 x(22n) t0 t1 (I_2
tensor F_m)x0 x1 t0 fft(x0) t1
fft(x1) w W_m(omega_n) w
exp((2pii/n)(0n/2-1)) y y0 y1 (F_2
tensor I_m) Tn_m t0 t1 y0 t0 w.t1 y1
t0 - w.t1 y y0 y1 end
5Rewrite Rules
6FFT Variants
- Cooley-Tukey
- Recursive FFT
- Iterative FFT
- Vector FFT (Stockham)
- Vector FFT (Korn-Lambiotte)
- Parallel FFT (Pease)
7Example TPL Programs
- Recursive 8-point FFT
- (compose (tensor (F 2) (I 4)) (T 8 4)
- (tensor (I 2)
- (compose (tensor (F 2) (I 2)) (T
4 2) - (tensor (I 2) (F 2)) (L
4 2)) - (L 8 2))
- Iterative 8-point FFT
- (compose (tensor (F 2) (I 4)) (T 8 4)
- (tensor (I 2) (F 2) (I 2)) (tensor (I 2)
(T 4 2)) - (tensor (F 2) (I 4))
- (tensor (I 2) (L 4 2)
- (L 8 2))
8FFT Dataflow
- Different formulas for the FFT have different
dataflow (memory access patterns). - The dataflow in a class of FFT algorithms can be
described by a sequence of permutations. - An FFT dataflow is a sequence of permutations
that can be modified with the insertion of
butterfly computations (with appropriate twiddle
factors) to form a factorization of the Fourier
matrix. - FFT dataflows can be classified wrt to cost, and
used to find good FFT implementations.
9Distributed FFT Algorithm
- Experiment with different dataflow and locality
properties by changing radix and permutations
10Cooley-Tukey Dataflow
11Pease Dataflow
12Tensor Permutations
- A natural class of permutations compatible with
the FFT. Let ? be a permutation of 1,,t - Mixed-radix counting permutation of vector
indices - Well-known examples are stride permutations and
bit-reversal.
?
13Example (Stride Permutation)
- 000 000
- 001 100
- 010 001
- 011 011
- 100 010
- 101 110
- 110 101
- 111 111
14Example (Bit Reversal)
- 000 000
- 001 100
- 010 010
- 011 110
- 100 001
- 101 101
- 110 011
- 111 111
15Twiddle Factor Matrix
- Diagonal matrix containing roots of unity
- Generalized Twiddle (compatible with tensor
permutations)
16Distributed Computation
- Allocate equal-sized segments of vector to each
processor, and index distributed vector with pid
and local offset. - Interpret tensor product operations with this
addressing scheme
17Distributed Tensor Product and Twiddle Factors
- Assume P processors
- In?A, becomes parallel do over all processors
when n ? P. - Twiddle factors determined independently from pid
and offset. Necessary bits determined from I, J,
and (n1,,nt) in generalized twiddle notation.
18Distributed Tensor Permutations
19Classes of Distributed Tensor Permutations
- Local (pid is fixed by ?)
- Only permute elements locally within each
processor - Global (offset is fixed by ?)
- Permute the entire local arrays amongst the
processors - GlobalLocal (bits in pid and bits in offset
moved by ?, but no bits cross the pid/offset
boundary) - Permute elements locally followed by a Global
permutation - Mixed (at least one offset and pid bit are
exchanged) - Elements from a processor are sent/received
to/from more than one processor
20Distributed Stride Permutation
- 0000 0000
- 0001 1000
- 0010 0001
- 0011 1001
- 0100 0010
- 0101 1010
- 0110 0011
- 0111 1011
- 1000 0100
- 1001 1100
- 1010 0101
- 1011 1101
- 1100 0110
- 1101 1110
- 1110 0111
- 1111 1111
21 Communication Pattern
22 Communication Pattern
Each PE sends 1/2 data to 2 different PEs
23 Communication Pattern
Each PE sends 1/4 data to 4 different PEs
24 Communication Pattern
Each PE sends1/8 data to 8 different PEs
25Implementation of Distributed Stride Permutation
D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y
LN_S X // Inputs // Y,X distributed vectors of
size N 2t, // with M 2l elements per
processor // P 2k number of processors //
S 2j, 0 lt j lt k, is the stride. //
Output // Y LN_S X p pid for
i0,...,2j-1 do put x(iSiS(n/S-1)) in
y((n/S)(p mod S)(n/S)(p mod S)N/S-1) on
PE p/2j i2k-j
26 Cyclic Scheduling
Each PE sends 1/4 data to 4 different PEs
27Distributed Bit Reversal Permutation
- Mixed tensor permutation
- Implement using factorization
b7b6b5 b4b3b2b1b0
b0b1b2 b3b4b5b6b7
b7b6b5 b4b3b2b1b0
b5b6b7 b0b1b2b3b4
28Experiments on the CRAY T3E
- All experiments were performed on a 240 node
(8x4x8 with partial plane) T3E using 128
processors (300 MHz) with 128MB memory - Task 1(pairwise communication)
- Implemented with shmem_get, shmem_put, and
mpi_sendrecv - Task 2 (all 7! 5040 global tensor permutations)
- Implemented with shmem_get, shmem_put, and
mpi_sendrecv - Task 3 (local tensor permutations of the form I ?
L ? I on vectors of size 222 words - only run on
a single node) - Implemented using streams on/off, cache bypass
- Task 4 (distributed stride permutations)
- Implemented using shmem_iput, shmem_iget, and
mpi_sendrecv
29Task 1 Performance Data
30Task 2 Performance Data
31Task 3 Performance Data
32Task 4 Performance Data
33Network Simulator
- An idealized simulator for the T3E was developed
(with C. Grassl from Cray research) in order to
study contention - Specify processor layout and route table and
number of virual processors with a given start
node - Each processor can simultaneously issue a single
send - Contention is measured as the maximum number of
messages across any edge/node - Simulator used to study global and mixed tensor
permutations.
34Task 2 Grid Simulation Analysis
35Task 2 Grid Simulation Analysis
36Task 2 Torus Simulation Analysis
37Task 2 Torus Simulation Analysis