Title: Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from UC Berkeley for his teaching on parallel computing, and Vasily Volkov for his feedback on our
1Parallel-FFTs in 3D Testing different
implementation schemes Luis Acevedo-Arreguin,
Benjamin Byington, Erinna Chen, Adrienne Traxler.
Department of Applied Mathematics and Statistics,
UC Santa Cruz
Introduction
2. Transpose-Based parallel-FFTs
- Motivation
- Many problems in geophysics or astrophysics can
be modeled using the Navier-Stokes equations. - e.g. Earths magnetic field, mantle convection,
solar convection zone, ocean circulation - Various numerical methods can be employed to
solve these problems. Because of geometry and
symmetry, many numerical codes employ a spectral
method to solve the fluid dynamics. - Solutions can be written as sums of periodic
functions and their corresponding weights
(spectral coefficients) - Provides solutions that are continuous (can
obtain solutions that are at positions not along
the grid) - Particularly computationally challenging is the
calculation of the non-linear terms in the
equations - Using spectral coefficients, the calculation of
non-linear terms for a 3-d space is O(n3) - If non-linear terms are calculated in grid
(real) space, calculations are reduced to O(n2) - Fast Fourier transforms (FFTs) O(n2log n) make
it numerically tractable to perform frequent
transforms from real to spectral space (forward)
and from spectral to real space backward)
Figure 3 The starting setup Data is initially
decomposed into a stack of x-y planes, so each
processor can do a local FFT in two dimensions.
The data is then transposed so that each
processor now holds a set of y-z slices, allowing
the final transform (in the z direction) to be
completed. For the reverse procedure, a similar
process (transform, transpose, complete
transform) is required.
Figure 1 (left) Magnetic field polarity reversal
caused by fluid motion in the Earths core (G.
Glatzmaier, UCSC). (right) Layering phenomenon
caused by double-diffusion similar to
thermohaline staircases in the ocean (S.
Stellmach, UCSC)
- The transpose-based parallel-FFT utilizes FFTW
to perform a 2-D FFT (optimized in-processor
implementation) and then redistrubutes all data
for the FFT in the third dimension. - The optimal communication scheme for transposing
the third dimension to perform the final FFT is
not clear. - a) Complete all 2-D planes in-processor, call
MPI_ALLTOALL to transpose the data, block until
entire transformed data is obtained, perform
final FFT - b) Complete a single 2-D plane in-processor,
call MPI_ALLTOALL to transpose each plane
(overlapping communication with computation?),
block until entire transformed data is obtained,
perform final FFT - c) Complete a single 2-D plane in-processor,
utilize an explicit asynchronous communication
scheme, perform final FFT as data is available.
Schemes
- A naïve parallel-FFT
Figure 2 Schematic representation of a naïve
implementation of a 3D FFT. Rows, columns, and
finally stacks are sent to slave processors,
which perform 1D FFTs on the received vectors.
- The Parallel Problem
- If the code is serial FFTW takes all the work
out of optimizing the FFT - In general Want to have large domains to
resolve the fluid dynamics for realistic
geophysical/astro-physical constants - Use large distributed clusters with large
domains - Would like to use FFTW because it is the fastest
algorithm (in the West) - Autotuning is done for
us - Multiple schemes can be employed
- Transpose-based Data in a single-direction is
in-processor, once FFT is completed, re-decompose
domain in another direction - Parallel Leave data in place and send pieces to
other processors as necessary - Which is faster? Not obvious at the outset
- The data are decomposed in vectors in the three
directions. Each vector, whether a row, column,
or stack, is sent to a slave processor. A master
processor distributes tasks to each processor,
receives the 1D FFT performed to each vector by
the slave processors, and resend more work to
those processors that ended their previous
assignments. - This scheme aims to compute 3D FFT to datasets
that exhibit clustering, i.e. no homogeneity in
terms of density, which consequently requires
more intense computational work on some areas
than others. - This scheme is tested on NERSC Franklin by using
module fftw 3.1.1.
3. Parallel version of FFTW 3.2, alpha
- The most recent version of FFTW includes an
implementation of a 3-D parallel-FFT. - It is unclear how this is implemented whether
it completes a fully-parallel FFT (3-D domain
decomposition) or implements a transpose-based
FFT - However, if it autotunes the implementation of
the 3-D parallel FFT it allows for greater
portability and hides cluster topology. - Potentially provides a parallel implementation
where the end-user needs no knowledge of
parallelism in order to calculate the FFT. - Case for serial FFT use FFTW on any computer
Acknowledgments Thanks to Professor Nicholas
Brummell from UC Santa Cruz for his help on FFTs
after class, and also thanks to Professor James
Demmel from UC Berkeley for his teaching on
parallel computing, and Vasily Volkov for his
feedback on our homeworks.