Parallelizing Spacetime Discontinuous Galerkin Methods - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Parallelizing Spacetime Discontinuous Galerkin Methods

Description:

Parallel Programming Lab. Led by Professor Laxmikant Kale. Application-oriented ... Yellow Processes Migrate Away System Handles Message Routing. http: ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 24
Provided by: PPL7
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallelizing Spacetime Discontinuous Galerkin Methods


1
Parallelizing Spacetime Discontinuous Galerkin
Methods
  • Jonathan Booth
  • University of Illinois at Urbana/Champaign
  • In conjunction with L. Kale, R. Haber, S. Thite,
    J. Palaniappan
  • This research made possible via NSF grant DMR
    01-21695

http//charm.cs.uiuc.edu
2
Parallel Programming Lab
  • Led by Professor Laxmikant Kale
  • Application-oriented
  • Research is driven by real applications and the
    needs of real applications
  • NAMD
  • CSAR Rocket Simulation (Roc)
  • Spacetime Discontinuous Galerkin
  • Petaflops Performance Prediction (Blue Gene)
  • Focus on scaleable performance for real
    applications

http//charm.cs.uiuc.edu
3
Charm Overview
  • In development for roughly ten years
  • Based on C
  • Runs on many platforms
  • Desktops
  • Clusters
  • Supercomputers
  • Overlays a C layer called Converse
  • Allows multiple languages to work together

http//charm.cs.uiuc.edu
4
Charm Programmer View
Processor
  • System of objects
  • Asynchronous communication via method invocation
  • Use an object identifier to refer to an object.
  • User sees each object execute its methods
    atomically
  • As if on its own processor

Object/Task
http//charm.cs.uiuc.edu
5
Charm System View
  • Set of objects invoked by messages
  • Set of processors of the physical machine
  • Keeps track of object to processor mapping
  • Routes messages between objects

Processor
Object/Task
http//charm.cs.uiuc.edu
6
Charm Benefits
  • Program is not tied to a fixed number of
    processors
  • No problem if program needs 128 processors and
    only 45 available
  • Called processor virtualization
  • Load balancing accomplished automatically
  • User writes a short routine to transfer object
    between processors

http//charm.cs.uiuc.edu
7
Load Balancing - Green Process Starts Heavy
Computation
A
B
C
http//charm.cs.uiuc.edu
8
Yellow Processes Migrate Away System Handles
Message Routing
A
A
B
B
C
C
http//charm.cs.uiuc.edu
9
Load Balancing
  • Load balancing isnt solely dependant on CPU
    usage
  • Balancers consider network usage as well
  • Can move objects to lessen network bandwidth
    usage
  • Migrating an object to disk instead of another
    processor gives checkpoint/restart, out-of-core
    execution

http//charm.cs.uiuc.edu
10
Parallel Spacetime Discontinuous Galerkin
  • Mesh generation is an advancing front algorithm
  • Adds an independent set of elements called
    patches to the mesh
  • Spacetime methods are setup in such a way they
    are easy to parallelize
  • Each patch depends only on inflow elements
  • Cone constraint insures no other dependencies
  • Amount of data per patch is small
  • Inexpensive to send a patch and its inflow
    elements to another processor

http//charm.cs.uiuc.edu
11
Mesh Generation
Unsolved Patches
12
Mesh Generation
Unsolved Patches
Solved Patches
13
Mesh Generation
Refinement
Unsolved Patches
Solved Patches
14
Parallelization Method (1D)
  • Master-Slave method
  • Centralized mesh generation
  • Distributed physics solver code
  • Simplistic implementation
  • But fast to get running
  • Provides object migration sanity check
  • No time-step
  • as soon as a patch returns the master generates
    any new patches it can and sends them off to be
    solved

http//charm.cs.uiuc.edu
15
Results - Patches / Second
http//charm.cs.uiuc.edu
16
Scaling Problems
  • Speedup is ideal at 4 slave processors
  • After 4 slaves, diminishing speedup occurs
  • Possible sources
  • Network bandwidth overload
  • Charm system overhead (grainsize control)
  • Mesh generator overload
  • Problem doesnt scale-down
  • More processors dont slow the computation down

http//charm.cs.uiuc.edu
17
Network Bandwidth
  • Size of a patch to send both ways is 2048 bytes
    (very conservative estimate)
  • Can compute 36 patches/(secondCPU)
  • Each CPU needs 72kbytes/second
  • 100Mbit Ethernet provides 10Mbyte/sec
  • Network can support 130 CPUs
  • Must not be a lack of network bandwidth

http//charm.cs.uiuc.edu
18
Charm System Overhead (Grainsize Control)
  • Grainsize is a measure of the smallest unit of
    work
  • Too small and overhead dominates
  • Network latency overhead
  • Object creation overhead
  • Each patch takes 1.7ms to setup the connection to
    send (both ways)
  • Can send 550 patches/sec to remote processors
  • Again, higher than observed patch/second rate
  • Grainsize can be reduced by sending multiple
    patches at once
  • Speeds up the computation but speedup still
    flattens out after 8 processors

http//charm.cs.uiuc.edu
19
Mesh Generation
  • With 0 slave processors, 31ms/patch
  • With 1 slave processor, 27ms/patch
  • Geometry code takes 4ms to generate a patch
  • Mesh generator needs a bit more time due to
    Charm message sending overhead
  • Leads to less than 250 patches/second
  • Cant trivially speed this up
  • Would have to parallelize mesh generation
  • Parallel mesh generation also would lighten
    network load if the mesh were fully distributed
    to slave nodes

http//charm.cs.uiuc.edu
20
Testing the Mesh Generator Bottleneck
  • Does speeding up the mesh generator give better
    results?
  • Leaves the question how to speed up the mesh
    generator
  • The cluster used is a P3 Xeon 500Mhz
  • So run the mesh generator on something faster (a
    P4 2.8Ghz)
  • Everything still on 100Mbit network

21
Fast Mesh Generator Results
22
Future Directions
  • Parallelize geometry/mesh generation
  • Easy to do in theory
  • More complex in practice with refinement,
    coarsening
  • Lessens network bandwidth consumption
  • Only have to send border elements of all meshes
  • Compared to all elements sent right now
  • Better cache performance

http//charm.cs.uiuc.edu
23
More Future Directions
  • Send only necessary data
  • Currently send everything, needed or not
  • Use migration to balance load rather than slaves
  • Means well also get checkpoint/restart and
    out-of-core execution for free
  • Also means we can load balance away some of the
    network communication
  • Integrate 2D mesh generation/physics code
  • Nothing in the parallel code knows the
    dimensionality

http//charm.cs.uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com