Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*

About This Presentation

Title:

Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*

Description:

Experiment with different dataflow and locality properties by changing radix and permutations ... Mixed-radix counting permutation of vector indices ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 38

Provided by: JeremyR91

Learn more at: https://www.cs.drexel.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*

1
Parallel Processing (CS 730) Lecture 9
Distributed Memory FFTs

Jeremy R. Johnson
Wed. Mar. 1, 2001
Parts of this lecture was derived from material
from Johnson, Johnson, Pryor.

2
Introduction

Objective To derive and implement a
distributed-memory parallel program for computing
the fast Fourier transform (FFT).
Topics
Derivation of the FFT
Iterative version
Pease Algorithm Generalizations
Tensor permutations
Distributed implementation of tensor permutations
stride permutation
bit reversal
Distributed FFT

3
FFT as a Matrix Factorization

Compute y Fnx, where Fn is n-point Fourier
matrix.

4
Matrix Factorizations and Algorithms
function y fft(x) n length(x) if n 1
y x else x0 x1 Ln_2 x x0
x(12n-1) x1 x(22n) t0 t1 (I_2
tensor F_m)x0 x1 t0 fft(x0) t1
fft(x1) w W_m(omega_n) w
exp((2pii/n)(0n/2-1)) y y0 y1 (F_2
tensor I_m) Tn_m t0 t1 y0 t0 w.t1 y1
t0 - w.t1 y y0 y1 end
5
Rewrite Rules
6
FFT Variants

Cooley-Tukey
Recursive FFT
Iterative FFT
Vector FFT (Stockham)
Vector FFT (Korn-Lambiotte)
Parallel FFT (Pease)

7
Example TPL Programs

Recursive 8-point FFT
(compose (tensor (F 2) (I 4)) (T 8 4)
(tensor (I 2)
(compose (tensor (F 2) (I 2)) (T
4 2)
(tensor (I 2) (F 2)) (L
4 2))
(L 8 2))
Iterative 8-point FFT
(compose (tensor (F 2) (I 4)) (T 8 4)
(tensor (I 2) (F 2) (I 2)) (tensor (I 2)
(T 4 2))
(tensor (F 2) (I 4))
(tensor (I 2) (L 4 2)
(L 8 2))

8
FFT Dataflow

Different formulas for the FFT have different
dataflow (memory access patterns).
The dataflow in a class of FFT algorithms can be
described by a sequence of permutations.
An FFT dataflow is a sequence of permutations
that can be modified with the insertion of
butterfly computations (with appropriate twiddle
factors) to form a factorization of the Fourier
matrix.
FFT dataflows can be classified wrt to cost, and
used to find good FFT implementations.

9
Distributed FFT Algorithm

Experiment with different dataflow and locality
properties by changing radix and permutations

10
Cooley-Tukey Dataflow
11
Pease Dataflow
12
Tensor Permutations

A natural class of permutations compatible with
the FFT. Let ? be a permutation of 1,,t
Mixed-radix counting permutation of vector
indices
Well-known examples are stride permutations and
bit-reversal.

?
13
Example (Stride Permutation)

000 000
001 100
010 001
011 011
100 010
101 110
110 101
111 111

14
Example (Bit Reversal)

000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111

15
Twiddle Factor Matrix

Diagonal matrix containing roots of unity
Generalized Twiddle (compatible with tensor
permutations)

16
Distributed Computation

Allocate equal-sized segments of vector to each
processor, and index distributed vector with pid
and local offset.
Interpret tensor product operations with this
addressing scheme

17
Distributed Tensor Product and Twiddle Factors

Assume P processors
In?A, becomes parallel do over all processors
when n ? P.
Twiddle factors determined independently from pid
and offset. Necessary bits determined from I, J,
and (n1,,nt) in generalized twiddle notation.

18
Distributed Tensor Permutations
19
Classes of Distributed Tensor Permutations

Local (pid is fixed by ?)
Only permute elements locally within each
processor
Global (offset is fixed by ?)
Permute the entire local arrays amongst the
processors
GlobalLocal (bits in pid and bits in offset
moved by ?, but no bits cross the pid/offset
boundary)
Permute elements locally followed by a Global
permutation
Mixed (at least one offset and pid bit are
exchanged)
Elements from a processor are sent/received
to/from more than one processor

20
Distributed Stride Permutation

0000 0000
0001 1000
0010 0001
0011 1001
0100 0010
0101 1010
0110 0011
0111 1011

1000 0100
1001 1100
1010 0101
1011 1101
1100 0110
1101 1110
1110 0111
1111 1111

21
Communication Pattern
22
Communication Pattern
Each PE sends 1/2 data to 2 different PEs
23
Communication Pattern
Each PE sends 1/4 data to 4 different PEs
24
Communication Pattern
Each PE sends1/8 data to 8 different PEs
25
Implementation of Distributed Stride Permutation
D_Stride(Y,N,t,P,k,M,l,S,j,X) // Compute Y
LN_S X // Inputs // Y,X distributed vectors of
size N 2t, // with M 2l elements per
processor // P 2k number of processors //
S 2j, 0 lt j lt k, is the stride. //
Output // Y LN_S X p pid for
i0,...,2j-1 do put x(iSiS(n/S-1)) in
y((n/S)(p mod S)(n/S)(p mod S)N/S-1) on
PE p/2j i2k-j
26
Cyclic Scheduling
Each PE sends 1/4 data to 4 different PEs
27
Distributed Bit Reversal Permutation

Mixed tensor permutation
Implement using factorization

b7b6b5 b4b3b2b1b0
b0b1b2 b3b4b5b6b7
b7b6b5 b4b3b2b1b0
b5b6b7 b0b1b2b3b4
28
Experiments on the CRAY T3E

All experiments were performed on a 240 node
(8x4x8 with partial plane) T3E using 128
processors (300 MHz) with 128MB memory
Task 1(pairwise communication)
Implemented with shmem_get, shmem_put, and
mpi_sendrecv
Task 2 (all 7! 5040 global tensor permutations)
Implemented with shmem_get, shmem_put, and
mpi_sendrecv
Task 3 (local tensor permutations of the form I ?
L ? I on vectors of size 222 words - only run on
a single node)
Implemented using streams on/off, cache bypass
Task 4 (distributed stride permutations)
Implemented using shmem_iput, shmem_iget, and
mpi_sendrecv

29
Task 1 Performance Data
30
Task 2 Performance Data
31
Task 3 Performance Data
32
Task 4 Performance Data
33
Network Simulator

An idealized simulator for the T3E was developed
(with C. Grassl from Cray research) in order to
study contention
Specify processor layout and route table and
number of virual processors with a given start
node
Each processor can simultaneously issue a single
send
Contention is measured as the maximum number of
messages across any edge/node
Simulator used to study global and mixed tensor
permutations.

34
Task 2 Grid Simulation Analysis
35
Task 2 Grid Simulation Analysis
36
Task 2 Torus Simulation Analysis
37
Task 2 Torus Simulation Analysis

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs* - PowerPoint PPT Presentation

Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs*

Experiment with different dataflow and locality properties by changing radix and permutations ... Mixed-radix counting permutation of vector indices ... – PowerPoint PPT presentation