Poster Title: Use Upper and Lower Case Author 1, Author 2, Author N - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Poster Title: Use Upper and Lower Case Author 1, Author 2, Author N

Description:

... is architecture and operating system ... Meriem Ben Salah. Andrew Gearhart ... but the benefits must outweigh the cost of moving the data onto and off the GPU. ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 2
Provided by: georgep6
Category:

less

Transcript and Presenter's Notes

Title: Poster Title: Use Upper and Lower Case Author 1, Author 2, Author N


1
OPTIMIZATION OF THE POISSON OPERATOR IN CHOMBO
Razvan Carbunescu
Meriem Ben Salah
Andrew Gearhart
Motivation
Theoretical Background
Poisson Operator in CHOMBO
  • ? CHOMBO is a framework for implementing finite
    difference methods for the solution of partial
    differential equations on block structured
    adaptively refined rectangular grids.
  • ? CHOMBO provides elliptic and time-dependent
    modules, as well as support for standardized
    self-describing file formats.
  • ? Chombo is architecture and operating system
    independent.
  • ? For the use on parallel platforms, CHOMBO
    provides solely a distributed memory
    implementation in MPI (Message Passing
    Interface).
  • Is the use of a distributed memory implementation
    always beneficial?
  • The Poisson operator, aka the Laplacian, is a
    second order elliptic differential operator and
    defined in an n-dimensional Cartesian space by
  • The Poisson operator appears in the definition of
    the Helmholtz differential equation
  • The Helmholtz differential equation reduces to
    the Poisson equation
  • The Poisson equation is used in the modeling of
    various boundary value physical problems e.g.
    electric potential in electrostatics, potential
    flow in fluid dynamics etc.
  • The definition of the appropriate boundary
    conditions, Dirichlet and Neumann, allows for the
    solution of the Poisson problem.
  • A numerical solution requires the discretization
    of the continuous Poissons equation, e.g. by the
    standard centered-difference approximation, as
    well as a discrete handling of the Dirichlet and
    Neumann boundary conditions.
  • For the sake of a demonstrative exposition, we
    use the Poisson potential flow solve done in the
    incompressible Navier Stokes equations.
  • A Poisson solve is conducted at the beginning of
    an incompressible flow simulation to obtain
    initial conditions to the evolution of the
    velocity and pressure.

Project Strategy and Goals
  • Focus on the stencil kernel which applies the
    poisson operator on a two dimensional
    cell-centered data that is embedded in the
    AMRElliptic tool, a multigrid-based elliptic and
    parabolic equation solver for adaptive mesh
    hierarchies library.
  • Start with the serial and distributed memory
    implementation supplied in CHOMBO for reference.
  • Implement parallel shared memory architectures.
  • Conduct a parameter study to analyze the benefits
    and the drawbacks of each of the implementations
    in terms of computational time.
  • Draw global conclusions on the consequences of
    the limitations of the parallel implementation on
    CHOMBO.

Fig.1 Flow Boundary Condition Initialialization
Fig. 2 Flow Evolution Start-Up After A Poisson
Solve
Speedup
Targeted Architectures
Investigations and Results
Existing Implementations
  • The serial, the distributed memory and the shared
    memory implementations of the Poisson solve have
    been run on the NERSC Cray XT4 system, called
    Franklin, a massively parallel processing system
    with 9,532 compute nodes (quad processor cores)
    and 38,128 processors. The implementation on the
    GPU is being conducted on personal Linux boxes as
    well as the r56 and 57 nodes of the Millennium
    cluster.
  • The number of grid cells in the spatial partition
    as well as the number of grid refinements affect
    the computation of the discrete Poisson stencil
    and are the focus of these test runs.
  • In order to have some sense of fair comparison,
    the number of grid cells and the number of grid
    refinements are kept equal through the test runs.
    To account for different loading of the machine
    architecture each test was repeated 5 times and
    its corresponding results averaged.
  • CHOMBOs implementation is currently tuned for
    distributed memory with the domain being split up
    into small boxes on the order of 32 squared for
    2D and 32 cubed for 3D and each box is being
    assigned to a processor which runs serial Fortran
    77 code.
  • Because of the small size of the box there is not
    a lot of computational intensity to use a
    threaded shared memory implementation or to hide
    the cost of the transfer to the GPU.

ncell
Fig.4 Speedup vs. C serial implementation
Our interest
Figure 3 shows a comparison of the runtimes of
the C serial code vs the Fortran 77 serial code.
Besides the growing time, it is interesting to
note that despite the similar matrix storage in
Fortran77 and C (column major), the Fortran 77
code is accessed fastest when loops are ordered
as n,j,i vs the C code that is indexed
i,j,n. Figure 4 depicts the speed-up of the
various parallel implementations with respect to
the associated serial C version, C_MPI44 refers
to running the MPI version with all 4 cores on 1
node and C_MPI41 refers to running the MPI
version with 4 cores but 1 core per node. Figure
5 presents the relative speedup of all the codes
with respect to the fastest (Fortran) serial
version.
  • While CHOMBOs code obtains good performance for
    MPI on a distributed memory system our study is
    aimed at determining whether we can improve this
    speedup through the use of locality and faster
    access time to on-chip cores.
  • Our current implementation on the shared memory
    model only uses a limited basic strip-mining
    technique since it maintains the abstractions of
    a data iterator going through all the boxes each
    step but a more relaxed model could help with
    time also being allowed to be blocked.
  • Another interesting opportunity for speedup is
    running the operator on the GPU but the benefits
    must outweigh the cost of moving the data onto
    and off the GPU.
  • The previous points above raise the issue of
    threads and how for small boxes the lightweight
    threads are really important to allow for fast
    context switching and therefore hopefully
    increase performance. To allow for the use of
    pthreads and CUDA we implemented a C version of
    the operator.
  • The particular reason for this choice of study is
    to allow for the creation of heterogeneous
    systems that would automatically adjust their
    code to either MPI. Pthreads or Cuda underneath
    to achieve the best result.

Speedup
time
ncell
ncell
Fig.3 Fortran vs C serial implementation
Fig.5 Speedup vs. F serial implementation
Conclusions and Future Work
  • The shared memory implementation has beneficial
    results vs. the C serial code but more analysis
    is required to correctly compare to the Fortran
    MPI code.
  • At the time of this presentation no results have
    yet completed from the GPU. It will be
    interesting to find a correct methodology to
    compare the GPU results from Millennium with the
    results of MPI and Shared memory implementations.
    Since Franklin does not provide GPUs, the GPU
    test runs have to be verified independently. This
    work will be reported later in the project
    document.
  • It is obvious that the choice of the C
    implementation led to a loss of computational
    performance, and therefore a Fortran 77 shared
    memory implementation could be attractive.
    However, a POSIX threads interface to Fortran 77
    is currently not available. A improvement on the
    shared memory implementation might exist in
    creating the OpenMP Fortran 77 solver code.
  • Currently our simulations are only performed on
    the Cray XT4 and the GPU. It would be interesting
    to conduct these studies on different machine
    architectures.

References 1 John Kubiatowicz, 2009, CS252
Graduate Computer Architecture Lecture 24,
Network Interface Design Memory Consistency
Models 2 P. Colella et al., 2009, Chombo
Software Package for AMR Applications Design
Document 3Machine images obtained from
nersc.gov, krunker.com, compsource,com and
bit-tech.net
Contact information Meriem Ben Salah UC
Berkeley ME Graduate Student, ParLab,
meriem.ben.salah_at_berkeley.edu Razvan Corneliu
Carbunescu UC Berkeley CS Graduate Student,
ParLab, carazvan_at_eecs.berkeley.edu Andrew
Gearhart UC Berkeley CS Graduate Student,
ParLab, agearhart_at_eecs.berkeley.edu James
Demmel UC Berkeley Math CS Faculty,
demmel_at_eecs.berkeley.edu Phillip Colella LBNL
ANAG, pcolella_at_lbl.gov Brian Van Sraalen LBNL
ANAG, bvs_at_hprcd.lbl.gov
Write a Comment
User Comments (0)
About PowerShow.com