UPC and Titanium - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

UPC and Titanium

Description:

University of California, Berkeley and. Lawrence Berkeley National Laboratory ... Pointer size/representation easily reconfigured ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 38
Provided by: kathyy
Category:

less

Transcript and Presenter's Notes

Title: UPC and Titanium


1
UPC and Titanium
  • Open-source compilers and tools for
  • scalable global address space computing
  • Kathy Yelick
  • University of California, Berkeley and
  • Lawrence Berkeley National Laboratory

2
Outline
  • Global Address Languages in General
  • Distinction between languages and libraries
  • UPC
  • Language overview
  • Berkeley UPC compiler status and microbenchmarks
  • Application benchmarks and plans
  • Titanium
  • Language overview
  • Berkeley Titanium compiler status
  • Application benchmarks and plans

3
Global Address Space Languages
  • Explicitly-parallel programming model with SPMD
    parallelism
  • Fixed at program start-up, typically 1 thread per
    processor
  • Global address space model of memory
  • Allows programmer to directly represent
    distributed data structures
  • Address space is logically partitioned
  • Local vs. remote memory (two-level hierarchy)
  • Programmer control over performance critical
    decisions
  • Data layout and communication
  • Performance transparency and tunability are goals
  • Initial implementation can use fine-grained
    shared memory
  • Suitable for current and future architectures
  • Either shared memory or lightweight messaging is
    key
  • Base languages differ UPC (C), CAF (Fortran),
    Titanium (Java)

4
Global Address Space
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
  • The languages share the global address space
    abstraction
  • Shared memory is partitioned by processors
  • Remote memory may stay remote no automatic
    caching implied
  • One-sided communication through reads/writes of
    shared variables
  • Both individual and bulk memory copies
  • Differ on details
  • Some models have a separate private memory area
  • Distributed arrays generality and how they are
    constructed

5
UPC Programming Model Features
  • SPMD parallelism
  • fixed number of images during execution
  • images operate asynchronously
  • Several kinds of array distributions
  • double an a private
    n-element array on each processor
  • shared double an a n-element shared
    array, with cyclic mapping
  • shared 4 double an a block cyclic array with
    4-element blocks
  • shared 0 double a (shared 0 double )
    upc_alloc(n)
  • a
    shared array with all elements local
  • Pointers for irregular data structures
  • shared double sp a pointer to shared
    data
  • double lp a pointers to
    private data

6
UPC Programming Model Features
  • Global synchronization
  • upc_barrier traditional
    barrier
  • upc_notify/upc_wait split-phase global
    synchronization
  • Pair-wise synchronization
  • upc_lock/upc_unlock traditional locks
  • Memory consistence has two types of accesses
  • Strict must be performed immediately and
    atomically typically a blocking round-trip
    message if remote
  • Relaxed still must preserve dependencies, but
    other processors may view these as happening out
    of order
  • Parallel I/O
  • Based on ideas in MPI I/O
  • Specification for UPC by Thakur, El Ghazawi et al

7
Berkeley UPC Compiler
  • Compiler based on Open64
  • Recently merged Rice sources
  • Multiple front-ends, including gcc
  • Intermediate form called WHIRL
  • Current focus on C backend
  • IA64 possible in future
  • UPC Runtime
  • Pointer representation
  • Shared/distribute memory
  • Communication in GASNet
  • Portable
  • Language-independent

UPC
Higher WHIRL
Optimizing transformations
C Runtime
Lower WHIRL
Assembly IA64, MIPS, Runtime
8
Design for Portability Performance
  • UPC to C translator
  • Translates UPC to C insert runtime calls for
    parallel features
  • UPC runtime
  • Allocate shared data implement
    pointers-to-shared
  • GASNet
  • A uniform interface for low-level communication
    primitives
  • Portability
  • C is our intermediate language
  • GASNet is itself layered with a small core as the
    essential part
  • High-Performance
  • Native C compiler optimizes serial code
  • Translator can perform communication
    optimizations
  • GASNet can access network directly

9
Berkeley UPC Compiler Status
  • UPC Extensions added to front-end
  • Code-generation complete
  • Some issues related to code quality (hints to
    backend compilers)
  • GASNet communication layer
  • Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM,
    and MPI
  • Optimized for small non-blocking messages and
    compiled code
  • Next step strided and indexed put/get leveraging
    ARMCI work
  • UPC Runtime layer
  • Developed and tested on all GASNet
    implementations
  • Supports multiple pointer representations
  • Next step direct shared memory support
  • Release scheduled for later this month
  • Glitch related to include files and usability to
    iron out

10
Pointer-to-Shared Representation
  • UPC has three difference kinds of pointers
  • Block-cyclic, cyclic, and indefinite (always
    local)
  • A pointer needs a phase to keep track of where
    it is in a block
  • Source of overhead for updating and
    de-referencing
  • Consumes space in the pointer
  • Our runtime has special cases for
  • Phaseless (cyclic and indefinite) skip phase
    update
  • Indefinite skip thread id update
  • Pointer size/representation easily reconfigured
  • 64 bits on small machines, 128 on large, word or
    struct

11
Preliminary Performance
  • Testbed
  • Compaq AlphaServer, with Quadrics GASNet conduit
  • Compaq C compiler for the translated C code
  • Microbenchmarks
  • Measures the cost of UPC language features and
    construct
  • Shared pointer arithmetic, barrier, allocation,
    etc
  • Vector addition no remote communication
  • NAS Parallel Benchmarks
  • EP no communication
  • IS large bulk memory operations
  • MG bulk memput
  • CG fine-grained vs. bulk memput

12
Performance of Shared Pointer Arithmetic
  • Phaseless pointers are an important optimization
  • Indefinite pointers almost as fast as regular C
    pointers
  • General blocked cyclic pointer 7x slower for
    addition
  • Competitive with HP compiler, which generates
    native code
  • Both compiler have known opportunities for
    improvement

13
Cost of Shared Memory Access
  • Local shared accesses somewhat slower than
    private ones
  • HP has improved local performance in newer
    version
  • Remote accesses worse than local, as expected
  • Runtime/GASNet layering for portability is not a
    problem

14
NAS PB EP
  • EP Embarrassingly Parallel has no communication
  • Serial performance via C code generation is not a
    problem

15
NAS PB IS
  • IS Integer Sort is dominated by Bulk
    Communication
  • GASNet bulk communication adds no measurable
    overhead

16
NAS PB MG
  • MG Multigrid involves medium bulk copies
  • Berkeley reveals a slight serial performance
    degradation due to casts
  • Berkeley-C uses the original C code for the inner
    loops

17
Scaling MG on the T3E
  • Scalability of the language shown here for the
    T3E compiler
  • Directly shared memory support is probably needed
    to be competitive on most current machines

18
Mesh Generation in UPC
  • Parallel Mesh Generation in UPC
  • 2D Delaunay triangulation
  • Based on Triangle software by Shewchuk (UCB)
  • Parallel version from NERSC uses dynamic load
    balancing, software caching, and parallel sorting

19
Research in Optimizations
  • Privatizing accesses for local memory
  • In conjunction with elimination of forall loop
    affinity tests
  • Communication optimizations
  • Separate get/put from sync exploit split-phase
    barrier
  • Message aggregation (fine-grained to bulk)
  • Software caching
  • Research problems
  • Optimization selection based on performance
    model
  • Language research in the UPC memory consistency
    model

20
Preliminary Performance Results
  • UPC communication optimizations
  • Performed by hand
  • Remote fetch-and-increment (not random data)

21
UPC Interactions
  • UPC consortium
  • Tarek El-Ghazawi is coordinator semi-annual
    meetings, daily e-mail
  • Revised UPC Language Specification (IDA,GWU,)
  • UPC Collectives (MTU)
  • UPC I/O Specifications (GWU, ANL-PModels)
  • Other Implementations
  • HP (Alpha cluster and CMPI compiler (with MTU))
  • MTU (CMPI Compiler based on HP compiler, memory
    model)
  • Cray (X1 implementation)
  • Intrepid (SGI implementation based on gcc)
  • Etnus (debugging)
  • UPC Book T. El-Ghazawi, B. Carlson, T. Sterling,
    K. Yelick
  • Goal is proofs by SC03
  • HPC HPCS Effort
  • Recent interest from Sandia

22
Titanium
  • Based on Java, a cleaner C
  • classes, automatic memory management, etc.
  • compiled to C and then native binary (no JVM)
  • Same parallelism model as UPC and CAF
  • SPMD with a global address space
  • Dynamic Java threads are not supported
  • Optimizing compiler
  • static (compile-time) optimizer, not a JIT
  • communication and memory optimizations
  • synchronization analysis (e.g. static barrier
    analysis)
  • cache and other uniprocessor optimizations

23
Summary of Features Added to Java
  1. Scalable parallelism (Java threads replaced)
  2. Immutable (value) classes
  3. Multidimensional arrays with unordered iteration
  4. Checked Synchronization
  5. Operator overloading
  6. Templates
  7. Zone-based memory management (regions)
  8. Libraries for collective communication,
    distributed arrays, bulk I/O

24
Immutable Classes in Titanium
  • For small objects, would sometimes prefer
  • to avoid level of indirection
  • pass by value (copy entire object)
  • especially when immutable -- fields never
    modified
  • Example
  • immutable class Complex
  • Complex () real0 imag0
  • Complex operator (Complex c) ...
  • Complex c1 new Complex(7.1, 4.3)
  • c1 c1 c1
  • Addresses performance and programmability
  • Similar to structs in C (not C classes) in
    terms of performance
  • Adds support for complex types

25
Multidimensional Arrays
  • Arrays in Java are objects
  • Array bounds are checked
  • Multidimensional arrays are arrays-of-arrays
  • Safe and general, but potentially slow
  • New kind of multidimensional array added to
    Titanium
  • Sub-arrays are supported (interior, boundary,
    etc.)
  • Indexed by Points (tuple of ints)
  • Combined with unordered iteration to enable
    optimizations
  • foreach (p within A.domain())
  • Ap...
  • A could be multidimensional, an interior
    region, etc.

26
Communication
  • Titanium has explicit global communication
  • Broadcast, reduction, etc.
  • Primarily used to set up distributed data
    structures
  • Most communication is implicit through the shared
    address space
  • Dereferencing a global reference, g.x, can
    generate communication
  • Arrays have copy operations, which generate bulk
    communication A1.copy(A2)
  • Automatically computes the intersection of A1 and
    A2s index set or domain

27
Distributed Data Structures
  • Building distributed arrays
  • Particle 1d single 1d allParticle
  • new Particle 0Ti.numProcs-11d
  • Particle 1d myParticle
  • new Particle 0myParticleCount-1
  • allParticle.exchange(myParticle)
  • Now each processor has array of pointers, one to
    each processors chunk of particles

All to all broadcast
P0
P1
P2
28
Titanium Compiler Status
  • Titanium compiler runs on almost any machine
  • Requires a C compiler (and decent C to compile
    translator)
  • Pthreads for shared memory
  • Communication layer for distributed memory (or
    hybrid)
  • Recently moved to live on GASNet obtained GM,
    Elan, and improved LAPI implementation
  • Leverages other PModels work for maintenance
  • Recent language extensions
  • Indexed array copy (scatter/gather style)
  • Non-blocking array copy under development
  • Compiler optimizations
  • Cache optimizations, for loop optimizations
  • Communication optimizations for overlap,
    pipelining, and scatter/gather under development

29
Applications in Titanium
  • Several benchmarks
  • Fluid solvers with Adaptive Mesh Refinement (AMR)
  • Conjugate Gradient
  • 3D Multigrid
  • Unstructured mesh kernel EM3D
  • Dense linear algebra LU, MatMul
  • Tree-structured n-body code
  • Finite element benchmark
  • Genetics micro-array selection
  • SciMark serial benchmarks
  • Larger applications
  • Heart simulation
  • Ocean modeling with AMR (in progress)

30
Serial Performance (Pure Java)
  • Several optimizations in Titanium compiler (tc)
    over the past year
  • These codes are all written in pure Java without
    performance extensions

31
AMR for Ocean Modeling
  • Ocean Modeling Wen, Colella
  • Require embedded boundaries to model ocean
    floor/coastline
  • Line vs. point relaxation to handle aspect ratio
    1000km x 10km
  • Result in irregular data structures and array
    accesses
  • Goal for this year
  • Basin scale AMR circulation model
  • Currently a non-adaptive implementation
  • Compiler and language support design

Graphics from Titanium AMR Gas Dynamics
McCorquodale,Colella
32
Heart Simulation
  • Immersed Boundary Method Peskin/MacQueen
  • Fibers (e.g., heart muscles) modeled by list of
    fiber points
  • Fluid space modeled by a regular lattice
  • Irregular fiber lists need to interact with
    regular fluid lattice
  • Trade-off between load balancing of fibers and
    minimizing communication
  • Memory and communication intensive
  • Random array access is key problem in the
    performance
  • Developed compiler optimizations to improve their
    performance
  • Application effort funded by NSF/NPACI

33
Parallel Performance and Scalability
  • Poisson solver using Method of Local
    Corrections Balls, Colella
  • Communication lt 5 Scaled speedup nearly ideal
    (flat)
  • IBM SP
    Cray T3E

34
Titanium Interactions
  • GASNet interactions
  • In addition to the
  • Application collaborators
  • Charles Peskin and Dave McQueen and Courant
    Institute
  • Phil Colella and Tong Wen and LBNL
  • Scott Baden and Greg Balls and UCSD
  • Involved in Sun HPCS Effort
  • The GASNet work is common to UPC and Titanium
  • Joint effort between U.C. Berkeley and LBNL
  • (UPC project is primarily at LBNL Titanium is
    U.C. Berkeley)
  • Collaboration with Nieplocha on communication
    runtime
  • Participation in Global Address Space tutorials

35
  • The End
  • http//upc.nersc.gov
  • http//titanium.cs.berkeley.edu/

36
NAS PB CG
  • CG Conjugate Gradient can be written naturally
    with fine-grained communication in the sparse
    matrix-vector product
  • Worked well on the T3E (and hopefully will on the
    X1)
  • For other machines, a bulk version is required

37
NAS MG in Titanium
  • Preliminary Performance for MG code on IBM SP
Write a Comment
User Comments (0)
About PowerShow.com