UPC and Titanium - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

UPC and Titanium

Description:

University of California, Berkeley and. Lawrence Berkeley National Laboratory ... Pointer size/representation easily reconfigured ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 38

Provided by: kathyy

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: UPC and Titanium

1
UPC and Titanium

Open-source compilers and tools for
scalable global address space computing
Kathy Yelick
University of California, Berkeley and
Lawrence Berkeley National Laboratory

2
Outline

Global Address Languages in General
Distinction between languages and libraries
UPC
Language overview
Berkeley UPC compiler status and microbenchmarks
Application benchmarks and plans
Titanium
Language overview
Berkeley Titanium compiler status
Application benchmarks and plans

3
Global Address Space Languages

Explicitly-parallel programming model with SPMD
parallelism
Fixed at program start-up, typically 1 thread per
processor
Global address space model of memory
Allows programmer to directly represent
distributed data structures
Address space is logically partitioned
Local vs. remote memory (two-level hierarchy)
Programmer control over performance critical
decisions
Data layout and communication
Performance transparency and tunability are goals
Initial implementation can use fine-grained
shared memory
Suitable for current and future architectures
Either shared memory or lightweight messaging is
key
Base languages differ UPC (C), CAF (Fortran),
Titanium (Java)

4
Global Address Space
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private

The languages share the global address space
abstraction
Shared memory is partitioned by processors
Remote memory may stay remote no automatic
caching implied
One-sided communication through reads/writes of
shared variables
Both individual and bulk memory copies
Differ on details
Some models have a separate private memory area
Distributed arrays generality and how they are
constructed

5
UPC Programming Model Features

SPMD parallelism
fixed number of images during execution
images operate asynchronously
Several kinds of array distributions
double an a private
n-element array on each processor
shared double an a n-element shared
array, with cyclic mapping
shared 4 double an a block cyclic array with
4-element blocks
shared 0 double a (shared 0 double )
upc_alloc(n)
a
shared array with all elements local
Pointers for irregular data structures
shared double sp a pointer to shared
data
double lp a pointers to
private data

6
UPC Programming Model Features

Global synchronization
upc_barrier traditional
barrier
upc_notify/upc_wait split-phase global
synchronization
Pair-wise synchronization
upc_lock/upc_unlock traditional locks
Memory consistence has two types of accesses
Strict must be performed immediately and
atomically typically a blocking round-trip
message if remote
Relaxed still must preserve dependencies, but
other processors may view these as happening out
of order
Parallel I/O
Based on ideas in MPI I/O
Specification for UPC by Thakur, El Ghazawi et al

7
Berkeley UPC Compiler

Compiler based on Open64
Recently merged Rice sources
Multiple front-ends, including gcc
Intermediate form called WHIRL
Current focus on C backend
IA64 possible in future
UPC Runtime
Pointer representation
Shared/distribute memory
Communication in GASNet
Portable
Language-independent

UPC
Higher WHIRL
Optimizing transformations
C Runtime
Lower WHIRL
Assembly IA64, MIPS, Runtime
8
Design for Portability Performance

UPC to C translator
Translates UPC to C insert runtime calls for
parallel features
UPC runtime
Allocate shared data implement
pointers-to-shared
GASNet
A uniform interface for low-level communication
primitives
Portability
C is our intermediate language
GASNet is itself layered with a small core as the
essential part
High-Performance
Native C compiler optimizes serial code
Translator can perform communication
optimizations
GASNet can access network directly

9
Berkeley UPC Compiler Status

UPC Extensions added to front-end
Code-generation complete
Some issues related to code quality (hints to
backend compilers)
GASNet communication layer
Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM,
and MPI
Optimized for small non-blocking messages and
compiled code
Next step strided and indexed put/get leveraging
ARMCI work
UPC Runtime layer
Developed and tested on all GASNet
implementations
Supports multiple pointer representations
Next step direct shared memory support
Release scheduled for later this month
Glitch related to include files and usability to
iron out

10
Pointer-to-Shared Representation

UPC has three difference kinds of pointers
Block-cyclic, cyclic, and indefinite (always
local)
A pointer needs a phase to keep track of where
it is in a block
Source of overhead for updating and
de-referencing
Consumes space in the pointer
Our runtime has special cases for
Phaseless (cyclic and indefinite) skip phase
update
Indefinite skip thread id update
Pointer size/representation easily reconfigured
64 bits on small machines, 128 on large, word or
struct

11
Preliminary Performance

Testbed
Compaq AlphaServer, with Quadrics GASNet conduit
Compaq C compiler for the translated C code
Microbenchmarks
Measures the cost of UPC language features and
construct
Shared pointer arithmetic, barrier, allocation,
etc
Vector addition no remote communication
NAS Parallel Benchmarks
EP no communication
IS large bulk memory operations
MG bulk memput
CG fine-grained vs. bulk memput

12
Performance of Shared Pointer Arithmetic

Phaseless pointers are an important optimization
Indefinite pointers almost as fast as regular C
pointers
General blocked cyclic pointer 7x slower for
addition
Competitive with HP compiler, which generates
native code
Both compiler have known opportunities for
improvement

13
Cost of Shared Memory Access

Local shared accesses somewhat slower than
private ones
HP has improved local performance in newer
version
Remote accesses worse than local, as expected
Runtime/GASNet layering for portability is not a
problem

14
NAS PB EP

EP Embarrassingly Parallel has no communication
Serial performance via C code generation is not a
problem

15
NAS PB IS

IS Integer Sort is dominated by Bulk
Communication
GASNet bulk communication adds no measurable
overhead

16
NAS PB MG

MG Multigrid involves medium bulk copies
Berkeley reveals a slight serial performance
degradation due to casts
Berkeley-C uses the original C code for the inner
loops

17
Scaling MG on the T3E

Scalability of the language shown here for the
T3E compiler
Directly shared memory support is probably needed
to be competitive on most current machines

18
Mesh Generation in UPC

Parallel Mesh Generation in UPC
2D Delaunay triangulation
Based on Triangle software by Shewchuk (UCB)
Parallel version from NERSC uses dynamic load
balancing, software caching, and parallel sorting

19
Research in Optimizations

Privatizing accesses for local memory
In conjunction with elimination of forall loop
affinity tests
Communication optimizations
Separate get/put from sync exploit split-phase
barrier
Message aggregation (fine-grained to bulk)
Software caching

Research problems
Optimization selection based on performance
model
Language research in the UPC memory consistency
model

20
Preliminary Performance Results

UPC communication optimizations
Performed by hand
Remote fetch-and-increment (not random data)

21
UPC Interactions

UPC consortium
Tarek El-Ghazawi is coordinator semi-annual
meetings, daily e-mail
Revised UPC Language Specification (IDA,GWU,)
UPC Collectives (MTU)
UPC I/O Specifications (GWU, ANL-PModels)
Other Implementations
HP (Alpha cluster and CMPI compiler (with MTU))
MTU (CMPI Compiler based on HP compiler, memory
model)
Cray (X1 implementation)
Intrepid (SGI implementation based on gcc)
Etnus (debugging)
UPC Book T. El-Ghazawi, B. Carlson, T. Sterling,
K. Yelick
Goal is proofs by SC03
HPC HPCS Effort
Recent interest from Sandia

22
Titanium

Based on Java, a cleaner C
classes, automatic memory management, etc.
compiled to C and then native binary (no JVM)
Same parallelism model as UPC and CAF
SPMD with a global address space
Dynamic Java threads are not supported
Optimizing compiler
static (compile-time) optimizer, not a JIT
communication and memory optimizations
synchronization analysis (e.g. static barrier
analysis)
cache and other uniprocessor optimizations

23
Summary of Features Added to Java

Scalable parallelism (Java threads replaced)
Immutable (value) classes
Multidimensional arrays with unordered iteration
Checked Synchronization
Operator overloading
Templates
Zone-based memory management (regions)
Libraries for collective communication,
distributed arrays, bulk I/O

24
Immutable Classes in Titanium

For small objects, would sometimes prefer
to avoid level of indirection
pass by value (copy entire object)
especially when immutable -- fields never
modified
Example
immutable class Complex
Complex () real0 imag0
Complex operator (Complex c) ...
Complex c1 new Complex(7.1, 4.3)
c1 c1 c1
Addresses performance and programmability
Similar to structs in C (not C classes) in
terms of performance
Adds support for complex types

25
Multidimensional Arrays

Arrays in Java are objects
Array bounds are checked
Multidimensional arrays are arrays-of-arrays
Safe and general, but potentially slow

New kind of multidimensional array added to
Titanium
Sub-arrays are supported (interior, boundary,
etc.)
Indexed by Points (tuple of ints)
Combined with unordered iteration to enable
optimizations
foreach (p within A.domain())
Ap...
A could be multidimensional, an interior
region, etc.

26
Communication

Titanium has explicit global communication
Broadcast, reduction, etc.
Primarily used to set up distributed data
structures
Most communication is implicit through the shared
address space
Dereferencing a global reference, g.x, can
generate communication
Arrays have copy operations, which generate bulk
communication A1.copy(A2)
Automatically computes the intersection of A1 and
A2s index set or domain

27
Distributed Data Structures

Building distributed arrays
Particle 1d single 1d allParticle
new Particle 0Ti.numProcs-11d
Particle 1d myParticle
new Particle 0myParticleCount-1
allParticle.exchange(myParticle)
Now each processor has array of pointers, one to
each processors chunk of particles

All to all broadcast
P0
P1
P2
28
Titanium Compiler Status

Titanium compiler runs on almost any machine
Requires a C compiler (and decent C to compile
translator)
Pthreads for shared memory
Communication layer for distributed memory (or
hybrid)
Recently moved to live on GASNet obtained GM,
Elan, and improved LAPI implementation
Leverages other PModels work for maintenance
Recent language extensions
Indexed array copy (scatter/gather style)
Non-blocking array copy under development
Compiler optimizations
Cache optimizations, for loop optimizations
Communication optimizations for overlap,
pipelining, and scatter/gather under development

29
Applications in Titanium

Several benchmarks
Fluid solvers with Adaptive Mesh Refinement (AMR)
Conjugate Gradient
3D Multigrid
Unstructured mesh kernel EM3D
Dense linear algebra LU, MatMul
Tree-structured n-body code
Finite element benchmark
Genetics micro-array selection
SciMark serial benchmarks
Larger applications
Heart simulation
Ocean modeling with AMR (in progress)

30
Serial Performance (Pure Java)

Several optimizations in Titanium compiler (tc)
over the past year
These codes are all written in pure Java without
performance extensions

31
AMR for Ocean Modeling

Ocean Modeling Wen, Colella
Require embedded boundaries to model ocean
floor/coastline
Line vs. point relaxation to handle aspect ratio
1000km x 10km
Result in irregular data structures and array
accesses
Goal for this year
Basin scale AMR circulation model
Currently a non-adaptive implementation
Compiler and language support design

Graphics from Titanium AMR Gas Dynamics
McCorquodale,Colella
32
Heart Simulation

Immersed Boundary Method Peskin/MacQueen
Fibers (e.g., heart muscles) modeled by list of
fiber points
Fluid space modeled by a regular lattice
Irregular fiber lists need to interact with
regular fluid lattice
Trade-off between load balancing of fibers and
minimizing communication
Memory and communication intensive

Random array access is key problem in the
performance
Developed compiler optimizations to improve their
performance
Application effort funded by NSF/NPACI

33
Parallel Performance and Scalability

Poisson solver using Method of Local
Corrections Balls, Colella
Communication lt 5 Scaled speedup nearly ideal
(flat)
IBM SP
Cray T3E

34
Titanium Interactions

GASNet interactions
In addition to the
Application collaborators
Charles Peskin and Dave McQueen and Courant
Institute
Phil Colella and Tong Wen and LBNL
Scott Baden and Greg Balls and UCSD
Involved in Sun HPCS Effort
The GASNet work is common to UPC and Titanium
Joint effort between U.C. Berkeley and LBNL
(UPC project is primarily at LBNL Titanium is
U.C. Berkeley)
Collaboration with Nieplocha on communication
runtime
Participation in Global Address Space tutorials

The End
http//upc.nersc.gov
http//titanium.cs.berkeley.edu/

36
NAS PB CG

CG Conjugate Gradient can be written naturally
with fine-grained communication in the sparse
matrix-vector product
Worked well on the T3E (and hopefully will on the
X1)
For other machines, a bulk version is required

37
NAS MG in Titanium