Title: The Zoltan Toolkit - Partitioning a Linear Accelerator for Tau3P
1The Zoltan Toolkit - Partitioning a Linear
Accelerator for Tau3P
- E. Boman, K. Devine, R. Heaphy, B. Hendrickson
Sandia National Labs, NM - N. Folwell, K. Ko, M. Wolf SLAC
- Pinar, LBL
-
- Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energys National Nuclear Security
Administration under contract DE-AC04-94AL85000.
2The Zoltan Toolkit
- Parallel, dynamic, adaptive computations need
many services to obtain peak performance. - Processor work loads change during computation.
- Communication patterns are complicated.
- Memory usage is dynamic.
- Application developers wrote their own solutions.
- Little expertise in such parallel algorithms.
- No capability to compare approaches.
- No code reuse.
Zoltan Toolkit of data services for dynamic,
unstructured, adaptive computations
3Zoltan Data Services
4Support for Many Applications
- Different applications, requirements, data
structures.
5Zoltan Interface
- Simple, easy-to-use interface.
- Small number of callable Zoltan functions.
- Callable from C, C, Fortran.
- Data-structure neutral design.
- Supports wide range of applications and data
structures. - Imposes no restrictions on applications data
structures. - Application does not have to build Zoltans data
structures. - Only requirement unique global IDs for objects.
- Application interface
- Zoltan queries the application for needed info.
- IDs of objects, coordinates, relationships to
other objects. - Application provides simple functions to answer
queries. - Small extra costs in memory and function-call
overhead.
6Partitioning and Dynamic Load Balancing
- Goals for static partitioning
- Distribute work evenly among processors.
- Minimize interprocessor communication.
- Desirable characteristics for dynamic load
balancing - Keep data movement costs low.
- Incremental partitioning small changes in
workloads produce only small changes in
decomposition. - Parallel, scalable implementation.
7No One-Size-Fits-All Solutions
- No single partitioner works best for all
applications. - Trade-offs
- Quality vs. speed.
- Geometric locality vs. data dependencies.
- Low data-movement costs vs. tolerance for
remapping. - Application developers may not know which
partitioner is best for application. - Zoltan contains suite of partitioning methods.
- Application changes only one parameter to switch
methods. - Allows experimentation/comparisons to find most
effective partitioner for application. - Advantage of toolkit approach.
8Zoltan Suite of Partitioning Algorithms
Recursive Coordinate Bisection (Berger,
Bokhari) Recursive Inertial Bisection
ParMETIS (Karypis, Schloegel, Kumar) Jostle
(Walshaw)
Space Filling Curves (Peano, Hilbert) Refinement-t
ree Partitioning (Mitchell) Octree Partitioning
(Loy, Flaherty)
9SLAC SciDAC project
55-cell Linear Accelerator with couplers
1,122,445 elements (H60VG3) Courtesy of Michael
Wolf, SLAC.
- Tau3P Electromagnetic field solver (SLAC)
- Kwok Ko, N. Folwell, M. Wolf (SLAC) K. Devine
(SNL) A. Pinar (LBL). - Long simulation times
- Tens of thousands of CPU hours
- Communication cost dominates
- Need high-quality static partitioning
10Several Partitioning Methods
11RCB-1D Partitioning
125 Cell RDDS (32 processors) Partitioning
Tau3P Runtime Max Adj. Procs Sum Adj. Procs Max Bound. Objs Sum Bound. Objs
ParMETIS 165.5 s 8 134 731 16405
RCB-1D (z) 67.7 s 3 66 2683 63510
RCB-3D 373.2 s 10 208 1404 24321
RIB-3D 266.8 s 8 162 808 20156
HSFC-3D 272.2 s 10 202 1279 26684
2.0 ns runtimeIBM SP3 (NERSC)
13Coupler Port Grouping Complication
14U Partitioning of 5 cell (32 processors)
15Tau3P Speedup
16Summary
55-cell Linear Accelerator with couplers
1,122,445 elements Courtesy of Michael Wolf,
SLAC.
- Dont blindly use graph partitioner
- In this case, 1-d RCB is much better
- Performance sensitive to number of adjacent
processors (not edge cut in graph) - Zoltan toolkit
- Provides easy access to several algorithms
- Zoltans 1D geometric partitioner reduced runtime
up to 68 on 512 processor IBM SP3.
17For More Zoltan Information...
- Zoltan Home Page
- http//www.cs.sandia.gov/Zoltan
- Users and Developers Guides
- Download Zoltan software under GNU LGPL.
- Email
- zoltan_at_cs.sandia.gov
18(No Transcript)
19Applications Adaptive Mesh Refinement
- Dynamic load balancing.
- Redistribute elements after mesh refinement.
- Keep data movement costs low.
- Recursive Coordinate Bisection
- Parent and child elements assigned to same
processor. - Inexpensive.
- Incremental.
Using RCB with AMR in SIERRA (Edwards, Rath,
Lober, et al., Sandia)
20U Partitioning vs. Z Partitioning
RCB-1D-Z Run Time
-- RCB-1D-Z Adj. Procs
Max. Adj. Procs
Run Time (s)
RCB-1D-U Run Time
-- RCB-1D-U Adj. Procs
Processors