Parallel Spectral Methods: Solving Elliptic Problems with FFTs - PowerPoint PPT Presentation

Loading...

PPT – Parallel Spectral Methods: Solving Elliptic Problems with FFTs PowerPoint presentation | free to download - id: 16c8dc-MzdhM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel Spectral Methods: Solving Elliptic Problems with FFTs

Description:

Solving Elliptic Problems with FFTs. Kathy Yelick. www.cs.berkeley.edu/~yelick/cs267_s07 ... 1D FFT is due to Edelman (see http://www-math.mit.edu/~edelman) ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 53
Provided by: DavidE1
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel Spectral Methods: Solving Elliptic Problems with FFTs


1
Parallel Spectral MethodsSolving Elliptic
Problems with FFTs
  • Kathy Yelick
  • www.cs.berkeley.edu/yelick/cs267_s07

2
References
  • Previous CS267 lectures
  • Lecture by Geoffrey Fox
  • http//grids.ucs.indiana.edu/ptliupages/presentati
    ons/PC2007/cps615fft00.ppt
  • FFTW project
  • http//www.fftw.org
  • Spiral project
  • http//www.spiral.net

3
Poissons equation arises in many models
3D ?2u/?x2 ?2u/?y2 ?2u/?z2 f(x,y,z)
f represents the sources also need boundary
conditions
2D ?2u/?x2 ?2u/?y2 f(x,y)
1D d2u/dx2 f(x)
  • Electrostatic or Gravitational Potential
    Potential(position)
  • Heat flow Temperature(position, time)
  • Diffusion Concentration(position, time)
  • Fluid flow Velocity,Pressure,Density(position,tim
    e)
  • Elasticity Stress,Strain(position,time)
  • Variations of Poisson have variable coefficients

4
Algorithms for 2D (3D) Poisson Equation (N n2
(n3) vars)
  • Algorithm Serial PRAM Memory Procs
  • Dense LU N3 N N2 N2
  • Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
  • Jacobi N2 (N5/3) N (N2/3) N N
  • Explicit Inv. N2 log N N2 N2
  • Conj.Gradients N3/2 (N4/3) N1/2(1/3) log N N N
  • Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
  • Sparse LU N3/2 (N2) N1/2 Nlog N (N4/3) N
  • FFT Nlog N log N N N
  • Multigrid N log2 N N N
  • Lower bound N log N N
  • PRAM is an idealized parallel model with zero
    cost communication
  • Reference James Demmel, Applied Numerical
    Linear Algebra, SIAM, 1997.

5
Solving Poissons Equation with the FFT
  • Express any 2D function defined in 0 ? x,y ? 1 as
    a series ?(x,y) Sj Sk ?jk sin(p jx) sin(p
    ky)
  • Here ?jk are called Fourier coefficient of ?(x,y)
  • The inverse of this is ?jk 4
    ?(x,y) sin(p jx) sin(p ky)
  • Poissons equation ?2 ? /? x2 ? 2 ? /? y2
    f(x,y) becomes
  • Sj Sk (-p2j2 - p2k2) ?jk sin(p jx) sin(p ky)
  • Sj Sk fjk sin(p jx) sin(p ky)
  • where fjk are Fourier coefficients of f(x,y)
  • and f(x,y) Sj Sk fjk sin(p jx) sin(p ky)
  • This implies PDE can be solved exactly
    algebraically, ?jk fjk / (-p2j2 - p2k2)

6
Solving Poissons Equation with the FFT
  • So solution of Poissons equation involves the
    following steps
  • 1) Find the Fourier coefficients fjk of f(x,y) by
    performing integral
  • 2) Form the Fourier coefficients of ? by
  • ?jk fjk / (-p2j2 - p2k2)
  • 3) Construct the solution by performing sum
    ?(x,y)
  • There is another version of this (Discrete
    Fourier Transform) which deals with functions
    defined at grid points and not directly the
    continuous integral
  • Also the simplest (mathematically) transform uses
    exp(-2pijx) not sin(p jx)
  • Let us first consider 1D discrete version of this
    case
  • PDE case normally deals with discretized
    functions as these needed for other parts of
    problem

7
Serial FFT
  • Let isqrt(-1) and index matrices and vectors
    from 0.
  • The Discrete Fourier Transform of an m-element
    vector v is
  • Fv
  • Where F is the mm matrix defined as
  • Fj,k v (jk)
  • Where v is
  • v e (2pi/m) cos(2p/m)
    isin(2p/m)
  • v is a complex number with whose mth power vm 1
    and is therefore called an mth root of unity
  • E.g., for m 4
  • v i, v2 -1, v3 -i, v4
    1,

8
Using the 1D FFT for filtering
  • Signal sin(7t) .5 sin(5t) at 128 points
  • Noise random number bounded by .75
  • Filter by zeroing out FFT components lt .25

9
Using the 2D FFT for image compression
  • Image 200x320 matrix of values
  • Compress by keeping largest 2.5 of FFT
    components
  • Similar idea used by jpeg

10
Related Transforms
  • Most applications require multiplication by both
    F and inverse(F).
  • Multiplying by F and inverse(F) are essentially
    the same. (inverse(F) is the complex conjugate
    of F divided by n.)
  • For solving the Poisson equation and various
    other applications, we use variations on the FFT
  • The sin transform -- imaginary part of F
  • The cos transform -- real part of F
  • Algorithms are similar, so we will focus on the
    forward FFT.

11
Serial Algorithm for the FFT
  • Compute the FFT of an m-element vector v, Fv
  • (Fv)j S F(j,k) v(k)
  • S v (jk) v(k)
  • S (v j)k v(k)
  • V(v j)
  • Where V is defined as the polynomial
  • V(x) S xk v(k)

m-1 k 0
m-1 k 0
m-1 k 0
m-1 k 0
12
Divide and Conquer FFT
  • V can be evaluated using divide-and-conquer
  • V(x) S (x)k v(k)
  • v0 x2v2
    x4v4
  • x(v1 x2v3
    x4v5 )
  • Veven(x2) xVodd(x2)
  • V has degree m-1, so Veven and Vodd are
    polynomials of degree m/2-1
  • We evaluate these at points (v j)2 for 0ltjltm-1
  • But this is really just m/2 different points,
    since
  • (v (jm/2) )2 (v j v m/2) )2 v 2j v m
    (v j)2
  • So FFT on m points reduced to 2 FFTs on m/2
    points
  • Divide and conquer!

m-1 k 0
13
Divide-and-Conquer FFT
  • FFT(v, v, m)
  • if m 1 return v0
  • else
  • veven FFT(v02m-2, v 2, m/2)
  • vodd FFT(v12m-1, v 2, m/2)
  • v-vec v0, v1, v (m/2-1)
  • return veven (v-vec . vodd),
  • veven - (v-vec . vodd)
  • The . above is component-wise multiply.
  • The , is construction an m-element vector
    from 2 m/2 element vectors
  • This results in an O(m log m) algorithm.

precomputed
14
An Iterative Algorithm
  • The call tree of the dc FFT algorithm is a
    complete binary tree of log m levels
  • An iterative algorithm that uses loops rather
    than recursion, goes each level in the tree
    starting at the bottom
  • Algorithm overwrites vi by (Fv)bitreverse(i)
  • Practical algorithms combine recursion (for
    memory hiearchy) and iteration (to avoid function
    call overhead)

FFT(0,1,2,3,,15) FFT(xxxx)
even
odd
FFT(1,3,,15) FFT(xxx1)
FFT(0,2,,14) FFT(xxx0)
FFT(xx10)
FFT(xx01)
FFT(xx11)
FFT(xx00)
FFT(x100)
FFT(x010)
FFT(x110)
FFT(x001)
FFT(x101)
FFT(x011)
FFT(x111)
FFT(x000)
FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10)
FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13)
FFT(3) FFT(11) FFT(7) FFT(15)
15
Parallel 1D FFT
  • Data dependencies in 1D FFT
  • Butterfly pattern
  • A PRAM algorithm takes O(log m) time
  • each step to right is parallel
  • there are log m steps
  • What about communication cost?
  • See LogP paper for details

16
Block Layout of 1D FFT
  • Using a block layout (m/p contiguous elts per
    processor)
  • No communication in last log m/p steps
  • Each step requires fine-grained communication in
    first log p steps

17
Cyclic Layout of 1D FFT
  • Cyclic layout (only 1 element per processor,
    wrapped)
  • No communication in first log(m/p) steps
  • Communication in last log(p) steps

18
Parallel Complexity
  • m vector size, p number of processors
  • f time per flop 1
  • a startup for message (in f units)
  • b time per word in a message (in f units)
  • Time(blockFFT) Time(cyclicFFT)
  • 2mlog(m)/p
  • log(p) a
  • mlog(p)/p b

19
FFT With Transpose
  • If we start with a cyclic layout for first log(p)
    steps, there is no communication
  • Then transpose the vector for last log(m/p) steps
  • All communication is in the transpose
  • Note This example has log(m/p) log(p)
  • If log(m/p) gt log(p) more phases/layouts will be
    needed
  • We will work with this assumption for simplicity

20
Why is the Communication Step Called a Transpose?
  • Analogous to transposing an array
  • View as a 2D array of n/p by p
  • Note same idea is useful for uniprocessor caches

21
Complexity of the FFT with Transpose
  • If no communication is pipelined (overestimate!)
  • Time(transposeFFT)
  • 2mlog(m)/p
    same as before
  • (p-1) a
    was log(p) a
  • m(p-1)/p2 b
    was m log(p)/p b
  • If communication is pipelined, so we do not pay
    for p-1 messages, the second term becomes simply
    a, rather than (p-1)a.
  • This is close to optimal. See LogP paper for
    details.
  • See also following papers on class resource page
  • A. Sahai, Hiding Communication Costs in
    Bandwidth Limited FFT
  • R. Nishtala et al, Optimizing bandwidth limited
    problems using one-sided communication

22
Comment on the 1D Parallel FFT
  • The above algorithm leaves data in bit-reversed
    order
  • Some applications can use it this way, like
    Poisson
  • Others require another transpose-like operation
  • Other parallel algorithms also exist
  • A very different 1D FFT is due to Edelman (see
    http//www-math.mit.edu/edelman)
  • Based on the Fast Multipole algorithm
  • Less communication for non-bit-reversed algorithm

23
Higher Dimension FFTs
  • FFTs on 2 or 3 dimensions are define as 1D FFTs
    on vectors in all dimensions.
  • E.g., a 2D FFT does 1D FFTs on all rows and then
    all columns
  • There are 3 obvious possibilities for the 2D FFT
  • (1) 2D blocked layout for matrix, using 1D
    algorithms for each row and column
  • (2) Block row layout for matrix, using serial 1D
    FFTs on rows, followed by a transpose, then more
    serial 1D FFTs
  • (3) Block row layout for matrix, using serial 1D
    FFTs on rows, followed by parallel 1D FFTs on
    columns
  • Option 2 is best, if we overlap communication and
    computation
  • For a 3D FFT the options are similar
  • 2 phases done with serial FFTs, followed by a
    transpose for 3rd
  • can overlap communication with 2nd phase in
    practice

24
FFTW Fastest Fourier Transform in the West
  • www.fftw.org
  • Produces FFT implementation optimized for
  • Your version of FFT (complex, real,)
  • Your value of n (arbitrary, possibly prime)
  • Your architecture
  • Close to optimal for serial, can be improved for
    parallel
  • Similar in spirit to PHIPAC/ATLAS/Sparsity
  • Won 1999 Wilkinson Prize for Numerical Software
  • Widely used for serial FFTs
  • Had parallel FFTs in version 2, but no longer
    supporting them
  • Layout constraints from users/apps network
    differences are hard to support

25
Bisection Bandwidth
  • FFT requires one (or more) transpose operations
  • Ever processor send 1/P of its data to each other
    one
  • Bisection Bandwidth limits this performance
  • Bisection bandwidth is the bandwidth across the
    narrowest part of the network
  • Important in global transpose operations,
    all-to-all, etc.
  • Full bisection bandwidth is expensive
  • Fraction of machine cost in the network is
    increasing
  • Fat-tree and full crossbar topologies may be too
    expensive
  • Especially on machines with 100K and more
    processors
  • SMP clusters often limit bandwidth at the node
    level

26
Modified LogGP Model
  • LogGP no overlap
  • LogGP no overlap

P0
g
P1
EEL end to end latency (1/2 roundtrip) g
minimum time between small message sends G
additional gap per byte for larger messages
27
Historical Perspective
½ round-trip latency
  • Potential performance advantage for fine-grained,
    one-sided programs
  • Potential productivity advantage for irregular
    applications

28
General Observations
  • The overlap potential is the difference between
    the gap and overhead
  • No potential if CPU is tied up throughout message
    send
  • E.g., no send size DMA
  • Grows with message size for machines with DMA
    (per byte cost is handled by network)
  • Because per-Byte cost is handled by NIC
  • Grows with amount of network congestion
  • Because gap grows as network becomes saturated
  • Remote overhead is 0 for machine with RDMA

29
GASNet Communications System
  • GASNet offers put/get communication
  • One-sided no remote CPU involvement required in
    API (key difference with MPI)
  • Message contains remote address
  • No need to match with a receive
  • No implicit ordering required

Compiler-generated code
  • Used in language runtimes (UPC, etc.)
  • Fine-grained and bulk xfers
  • Split-phase communication

Language-specific runtime
GASNet
Network Hardware
30
Performance of 1-Sided vs 2-sided Communication
GASNet vs MPI
  • Comparison on Opteron/InfiniBand GASNets
    vapi-conduit and OSU MPI 0.9.5
  • Up to large message size (gt 256 Kb), GASNet
    provides up to 2.2X improvement in streaming
    bandwidth
  • Half power point (N/2) differs by one order of
    magnitude

31
GASNet Performance for mid-range message sizes
GASNet usually reaches saturation bandwidth
before MPI - fewer costs to amortize Usually
outperform MPI at medium message sizes - often by
a large margin
32
NAS FT Case Study
  • Performance of Exchange (Alltoall) is critical
  • Communication to computation ratio increases with
    faster, more optimized 1-D FFTs
  • Determined by available bisection bandwidth
  • Between 30-40 of the applications total runtime
  • Two ways to reduce Exchange cost
  • 1. Use a better network (higher Bisection BW)
  • 2. Overlap the all-to-all with communication
    (where possible) break up the exchange
  • Default NAS FT Fortran/MPI relies on 1
  • Our approach uses UPC/GASNet and builds on 2
  • Started as CS267 project
  • 1D partition of 3D grid is a limitation
  • At most N processors for N3 grid
  • HPC Challenge benchmark has large 1D FFT (can be
    viewed as 3D or more with proper roots of unity)

33
3D FFT Operation with Global Exchange
1D-FFT Columns
Transpose 1D-FFT (Rows)
1D-FFT (Columns)
Cachelines
1D-FFT Rows
Exchange (Alltoall)
send to Thread 0
send to Thread 1
Transpose 1D-FFT
Divide rows among threads
send to Thread 2
Last 1D-FFT (Thread 0s view)
  • Single Communication Operation (Global Exchange)
    sends THREADS large messages
  • Separate computation and communication phases

34
Communication Strategies for 3D FFT
chunk all rows with same destination
  • Three approaches
  • Chunk
  • Wait for 2nd dim FFTs to finish
  • Minimize messages
  • Slab
  • Wait for chunk of rows destined for 1 proc to
    finish
  • Overlap with computation
  • Pencil
  • Send each row as it completes
  • Maximize overlap and
  • Match natural layout

pencil 1 row
slab all rows in a single plane with same
destination
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
35
Decomposing NAS FT Exchange into Smaller Messages
  • Three approaches
  • Chunk
  • Wait for 2nd dim FFTs to finish
  • Slab
  • Wait for chunk of rows destined for 1 proc to
    finish
  • Pencil
  • Send each row as it completes
  • Example Message Size Breakdown for
  • Class D (2048 x 1024 x 1024)
  • at 256 processors

36
Overlapping Communication
  • Goal make use of all the wires
  • Distributed memory machines allow for
    asynchronous communication
  • Berkeley Non-blocking extensions expose GASNets
    non-blocking operations
  • Approach Break all-to-all communication
  • Interleave row computations and row
    communications since 1D-FFT is independent across
    rows
  • Decomposition can be into slabs (contiguous sets
    of rows) or pencils (individual row)
  • Pencils allow
  • Earlier start for communication phase and
    improved local cache use
  • But more smaller messages (same total volume)

37
NAS FT UPC Non-blocking MFlops
  • Berkeley UPC compiler support non-blocking UPC
    extensions
  • Produce 15-45 speedup over best UPC Blocking
    version
  • Non-blocking version requires about 30 extra
    lines of UPC code

38
NAS FT Variants Performance Summary
  • Shown are the largest classes/configurations
    possible on each test machine
  • MPI not particularly tuned for many small/medium
    size messages in flight (long message matching
    queue depths)

39
Pencil/Slab optimizations UPC vs MPI
  • Same data, viewed in the context of what MPI is
    able to overlap
  • For the amount of time that MPI spends in
    communication, how much of that time can UPC
    effectively overlap with computation
  • On Infiniband, UPC overlaps almost all the time
    the MPI spends in communication
  • On Elan3, UPC obtains more overlap than MPI as
    the problem scales up

40
Summary of Overlap in FFTs
  • One-sided communication has performance
    advantages
  • Better match for most networking hardware
  • Most cluster networks have RDMA support
  • Machines with global address space support (X1,
    Altix) shown elsewhere
  • Smaller messages may make better use of network
  • Spread communication over longer period of time
  • Postpone bisection bandwidth pain
  • Smaller messages can also prevent cache thrashing
    for packing
  • Avoid packing overheads if natural message size
    is reasonable

41
FFTW
the Fastest Fourier Tranform in the West
C library for real complex FFTs (arbitrary
size/dimensionality)
( parallel versions for threads MPI)
Computational kernels (80 of code)
automatically generated
Self-optimizes for your hardware (picks best
composition of steps) portability performance
42
FFTW performancepower-of-two sizes, double
precision
833 MHz Alpha EV6
2 GHz PowerPC G5
500 MHz Ultrasparc IIe
2 GHz AMD Opteron
43
FFTW performancenon-power-of-two sizes, double
precision
unusual non-power-of-two sizes receive as much
optimization as powers of two
833 MHz Alpha EV6
2 GHz AMD Opteron
because we let the code do the optimizing
44
FFTW performancedouble precision, 2.8GHz Pentium
IV 2-way SIMD (SSE2)
powers of two
exploiting CPU-specific SIMD instructions (rewriti
ng the code) is easy
non-powers-of-two
because we let the code write itself
45
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
The resulting plan is executed with explicit
recursion enhances locality
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
46
FFTW is easy to use
complex xn plan p p plan_dft_1d(n, x,
x, FORWARD, MEASURE) ... execute(p) / repeat
as needed / ... destroy_plan(p)
47
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
3
The resulting plan is executed with explicit
recursion enhances locality
1
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
2
48
FFTW Uses Natural Recursion
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
49
Traditional cache solution Blocking
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
breadth-first, but with blocks of size cache
requires program specialized for cache size
50
Recursive Divide Conquer is Good
Singleton, 1967
(depth-first traversal)
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
51
Cache Obliviousness
A cache-oblivious algorithm does not know the
cache size it can be optimal for any machine
for all levels of cache simultaneously
Exist for many other algorithms, too Frigo et
al. 1999
all via the recursive divide conquer approach
52
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
3
The resulting plan is executed with explicit
recursion enhances locality
1
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
2
About PowerShow.com