Title: Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit
1Combining Shared and Distributed Memory Models
Approach and Evolution of the Global Arrays
Toolkit
- Jarek Nieplocha
- Robert Harrison, Manoj Kumar Krishnan
- Bruce Palmer, Vinod Tipparaju, Harold Trease
- Pacific Northwest National Laboratory
2Overview
- Background
- Programming Model
- Core Capabilities
- Recent Work
- Conclusions
3Global address space One-sided communication
collection of address spaces of processes in a
parallel job (address, pid)
4Global Arrays Data Model
Physically distributed data
- shared memory model in context of distributed
dense arrays - complete environment for parallel code
development - compatible with MPI
- data locality control similar to distributed
memory/message passing model - extensible
single, shared data structure/ global indexing
e.g., A(4,3) rather than buf(7) on task 2
5Global Array Model of Computations
6Example Matrix Multiply
global arrays representing matrices
ga_acc
ga_get
dgemm
local buffers on the processor
7Comparison to other models
8Structure of GA
application interfaces Fortran 77, C, C, Python
distributed arrays layer memory management, index
translation
Message Passing process creation, run-time
environment
ARMCI portable 1-sided communication put,get,
locks, etc
system specific interfaces LAPI, GM/Myrinet,
threads, VIA,..
9Core Capabilities
- Distributed array library
- dense arrays 1-7 dimensions
- four data types integer, real, double precision,
double complex - global rather than per-task view of data
structures - user control over data distribution regular and
irregular - Collective and shared-memory style operations
- Interfaces to third party parallel numerical
libraries - PeIGS, Scalapack, SUMMA, Tao (under development)
- example to solve a linear system using LU
factorization - call ga_lu_solve(g_a, g_b)
- instead of
- call pdgetrf(n,m, locA, p, q, dA, ind, info)
- call pdgetrs(trans, n, mb, locA, p, q, dA,dB,info)
10Performance
- Performance model for "shared memory" data access
- array index translation e.g., 1.2 ?S on
Linux/PIII - overhead in one of more ARMCI put/get/... calls
- direct mapping to native RMA calls (e.g., 3?S on
Cray T3E) or - simple shared memory access (e.g., 0.3 ?S on
Linux/PIII) or - more complex due to the Active Message style
implementations e.g. 12 (put) 37 (get) ?S on
Linux/PIII with Myrinet
11Applications Areas
electronic structure
biology
glass flow simulation
Visualization and image analysis
thermal flow simulation
material sciences
molecular dynamics
Others financial security forecasting,
astrophysics, geosciences
12Major Milestones
- 1994 - 1st public release of GA
- 1995 - Metacomputing (grid) extensions of GA
- 1996 - DRA, parallel I/O for GA programs
developed - 1997 - development of ARMCI started
- 1998 - GA rewritten to use ARMCI
- 1999 - GA 3.0 released, n-dimensional arrays
- 2000 - periodic one-sided operations
- 2001 - support for sparse data management
- 2002 - ghost cell operations, n-dim DRA
13Ghost Cells
normal global array
global array with ghost cells
- Operations
- NGA_Create_ghosts - creates array with ghosts
cells - GA_Update_ghosts - updates with data from
adjacent processors - NGA_Access_ghosts - provides access to local
ghost cell elements - Embedded Synchronization - controlled by the
user - Multi-protocol implementation to match platform
characteristics - e.g., MPIshared memory on the IBM SP, SHMEM on
the Cray T3E
14Update Algorithms
- Standard algorithm 3D-1 messages
- Shift algorithm 2D messages
1st phase
2nd phase
15Disk Resident Arrays
- Extend GA model to disk
- system similar to Panda (UIUC) but higher level
APIs - Provide easy transfer of data between N-dim
arrays stored on disk and stored in memory - Use when
- Arrays too big to store in core
- checkpoint/restart
- out-of-core solvers
disk resident array
global array
image processing application
16Scalable Performance of DRA
SMP node
file systems
I/O buffers
processor
17Common Component Architecture
- A component model specifically designed for HPC
- Three parts Components, Ports and Frameworks
- Components
- peers
- interact through well-defined interfaces (ports)
- In OO Language a port is a class
- In Fortran, a port is a bunch of subroutines
- A component may provide a port - implement the
class/subroutines - Another component may use that port call
methods in the port - Framework holds the components and compose them
into applications - Advantages Reusable functionality, well-defined
interfaces, etc.
18Global Array CCA Component
GA Component
Application Component
addProvidesPort(ga) addProvidesPort(dadf)
registerUsesPort(ga) registerUsesPort(dadf)
CCA Services
CCA Services
Port Instance ga Port Class
GlobalArrayPort
Port Instance ga Port Class
GlobalArrayPort
getPort(ga)
Port Instance dadf Port Class DADFPort
Port Instance dadf Port Class DADFPort
getPort(dadf)
19CCA Elements
Application Component
GA Component
GlobalArrayPort DADFPort
GoPort
Well-known ports CCA Services Well-known ports
CCAFFEINE (CCA Framework) CCAFFEINE (CCA Framework) CCAFFEINE (CCA Framework)
20GA
- GA is a C class library for Global Arrays
- GA classes GAservices, GlobalArray
GAservices
Initialization, Termination, Inter-process
Synchronization, etc
GAservices
Initialization, Termination, Inter-process
Synchronization, etc
GAservices gs gs.initialize() Global Array
gags-gtcreateGA() do work
ga-gtdestroy() gs.terminate()
GAservices
GlobalArray
One-sided(get/put), collective array, Utility
operations
21Sparse data managment
- Sparse arrays can be implemented with
- 1-dimensional global arrays
- Nonzero elements, row and/or index arrays
- Set of new operations that follow Thinking
Machines CMSSL - Enumerate
- Pack/unpack
- Binning (NxM mapping)
- 2-key binning/sorting functions
- Scatter_with_OP, where OP,min,max
- Segmented_scan_with_OP, where OP,min,max,copy
- Adopted in NWPhys/NWGrid AMR package
- Next step - explicit sparse format
- need more application experience - too many
degrees of freedom
22Summary and Future
- The basic idea proven successful
- efficient on a wide range of architectures
- core operations tuned for high performance
- library substantially extended but all original
(1994) APIs preserved - increasing number of application areas
- Ongoing and future work
- Latency hiding on the low-end cluster networks by
relaxed memory consistency and replication - Advanced data structures
- sparse arrays and hash tables
- Increased support for the HPC community standards
- ESI, CCA