Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit - PowerPoint PPT Presentation

About This Presentation

Title:

Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit

Description:

Combining Shared and Distributed Memory Models. Approach and Evolution of the ... Framework holds the components and compose them into 'applications' ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 23

Provided by: jarekni

Learn more at: https://www.ece.lsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit

1
Combining Shared and Distributed Memory Models
Approach and Evolution of the Global Arrays
Toolkit

Jarek Nieplocha
Robert Harrison, Manoj Kumar Krishnan
Bruce Palmer, Vinod Tipparaju, Harold Trease
Pacific Northwest National Laboratory

2
Overview

Background
Programming Model
Core Capabilities
Recent Work
Conclusions

3
Global address space One-sided communication
collection of address spaces of processes in a
parallel job (address, pid)
4
Global Arrays Data Model
Physically distributed data

shared memory model in context of distributed
dense arrays
complete environment for parallel code
development
compatible with MPI
data locality control similar to distributed
memory/message passing model
extensible

single, shared data structure/ global indexing
e.g., A(4,3) rather than buf(7) on task 2
5
Global Array Model of Computations
6
Example Matrix Multiply
global arrays representing matrices

ga_acc
ga_get

dgemm
local buffers on the processor
7
Comparison to other models
8
Structure of GA
application interfaces Fortran 77, C, C, Python
distributed arrays layer memory management, index
translation
Message Passing process creation, run-time
environment
ARMCI portable 1-sided communication put,get,
locks, etc
system specific interfaces LAPI, GM/Myrinet,
threads, VIA,..
9
Core Capabilities

Distributed array library
dense arrays 1-7 dimensions
four data types integer, real, double precision,
double complex
global rather than per-task view of data
structures
user control over data distribution regular and
irregular
Collective and shared-memory style operations
Interfaces to third party parallel numerical
libraries
PeIGS, Scalapack, SUMMA, Tao (under development)
example to solve a linear system using LU
factorization
call ga_lu_solve(g_a, g_b)
instead of
call pdgetrf(n,m, locA, p, q, dA, ind, info)
call pdgetrs(trans, n, mb, locA, p, q, dA,dB,info)

10
Performance

Performance model for "shared memory" data access
array index translation e.g., 1.2 ?S on
Linux/PIII
overhead in one of more ARMCI put/get/... calls
direct mapping to native RMA calls (e.g., 3?S on
Cray T3E) or
simple shared memory access (e.g., 0.3 ?S on
Linux/PIII) or
more complex due to the Active Message style
implementations e.g. 12 (put) 37 (get) ?S on
Linux/PIII with Myrinet

11
Applications Areas
electronic structure
biology
glass flow simulation
Visualization and image analysis
thermal flow simulation
material sciences
molecular dynamics
Others financial security forecasting,
astrophysics, geosciences
12
Major Milestones

1994 - 1st public release of GA
1995 - Metacomputing (grid) extensions of GA
1996 - DRA, parallel I/O for GA programs
developed
1997 - development of ARMCI started
1998 - GA rewritten to use ARMCI
1999 - GA 3.0 released, n-dimensional arrays
2000 - periodic one-sided operations
2001 - support for sparse data management
2002 - ghost cell operations, n-dim DRA

13
Ghost Cells
normal global array
global array with ghost cells

Operations
NGA_Create_ghosts - creates array with ghosts
cells
GA_Update_ghosts - updates with data from
adjacent processors
NGA_Access_ghosts - provides access to local
ghost cell elements
Embedded Synchronization - controlled by the
user
Multi-protocol implementation to match platform
characteristics
e.g., MPIshared memory on the IBM SP, SHMEM on
the Cray T3E

14
Update Algorithms

Standard algorithm 3D-1 messages

Shift algorithm 2D messages

1st phase
2nd phase
15
Disk Resident Arrays

Extend GA model to disk
system similar to Panda (UIUC) but higher level
APIs
Provide easy transfer of data between N-dim
arrays stored on disk and stored in memory
Use when
Arrays too big to store in core
checkpoint/restart
out-of-core solvers

disk resident array
global array
image processing application
16
Scalable Performance of DRA
SMP node
file systems
I/O buffers
processor
17
Common Component Architecture

A component model specifically designed for HPC
Three parts Components, Ports and Frameworks
Components
peers
interact through well-defined interfaces (ports)
In OO Language a port is a class
In Fortran, a port is a bunch of subroutines
A component may provide a port - implement the
class/subroutines
Another component may use that port call
methods in the port
Framework holds the components and compose them
into applications
Advantages Reusable functionality, well-defined
interfaces, etc.

18
Global Array CCA Component
GA Component
Application Component

addProvidesPort(ga) addProvidesPort(dadf)
registerUsesPort(ga) registerUsesPort(dadf)
CCA Services
CCA Services
Port Instance ga Port Class
GlobalArrayPort
Port Instance ga Port Class
GlobalArrayPort
getPort(ga)
Port Instance dadf Port Class DADFPort
Port Instance dadf Port Class DADFPort
getPort(dadf)
19
CCA Elements
Application Component
GA Component
GlobalArrayPort DADFPort
GoPort
Well-known ports CCA Services Well-known ports
CCAFFEINE (CCA Framework) CCAFFEINE (CCA Framework) CCAFFEINE (CCA Framework)
20
GA

GA is a C class library for Global Arrays
GA classes GAservices, GlobalArray

GAservices
Initialization, Termination, Inter-process
Synchronization, etc
GAservices
Initialization, Termination, Inter-process
Synchronization, etc
GAservices gs gs.initialize() Global Array
gags-gtcreateGA() do work
ga-gtdestroy() gs.terminate()
GAservices
GlobalArray
One-sided(get/put), collective array, Utility
operations
21
Sparse data managment

Sparse arrays can be implemented with
1-dimensional global arrays
Nonzero elements, row and/or index arrays
Set of new operations that follow Thinking
Machines CMSSL
Enumerate
Pack/unpack
Binning (NxM mapping)
2-key binning/sorting functions
Scatter_with_OP, where OP,min,max
Segmented_scan_with_OP, where OP,min,max,copy
Adopted in NWPhys/NWGrid AMR package
Next step - explicit sparse format
need more application experience - too many
degrees of freedom

22
Summary and Future

The basic idea proven successful
efficient on a wide range of architectures
core operations tuned for high performance
library substantially extended but all original
(1994) APIs preserved
increasing number of application areas
Ongoing and future work
Latency hiding on the low-end cluster networks by
relaxed memory consistency and replication
Advanced data structures
sparse arrays and hash tables
Increased support for the HPC community standards
ESI, CCA