Unified Parallel C UPC and the Berkeley UPC Compiler - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Unified Parallel C UPC and the Berkeley UPC Compiler

Description:

Most parallel programs are written using either: Message passing with a SPMD model ... Uses light-weight multi-threading atop SPMD latency tolerant ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 12
Provided by: danb104
Category:

less

Transcript and Presenter's Notes

Title: Unified Parallel C UPC and the Berkeley UPC Compiler


1
Unified Parallel C (UPC) and the Berkeley UPC
Compiler
Wei Chen the Berkeley UPC Group 3/11/07
2
Parallel Programming
  • Most parallel programs are written using either
  • Message passing with a SPMD model
  • Usually for scientific applications with
    C/Fortran
  • Scales easily user controlled data layout
  • Hard to use send/receive matching, message
    packing/unpacking
  • Shared memory with OpenMP/pthreads/Java
  • Usually for non-scientific applications
  • Easier to program direct reads and writes to
    shared data
  • Hard to scale (mostly) limited to SMPs, no
    concept of locality
  • PGAS an alternative hybrid model

3
Partitioned Global Address Space
  • PGAS model uses global address space abstraction
  • Shared memory is partitioned by processors
  • User controlled data layout (global pointers and
    distributed arrays)
  • One-sided communication
  • Use RDMA support for reads/writes of shared
    variables
  • Much faster than message passing for small/medium
    size messages
  • Hybrid model works for both SMPs and clusters
  • Languages Titanium, Co-Array Fortran, UPC

X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
4
Unified Parallel C
  • A SPMD parallel extension of C
  • PGAS add shared qualifier to type system
  • Several kinds of shared array distributions
  • Fine-grained and bulk communication
  • Commercial compilers with Cray/HP/IBM
  • Open source compilers with Berkeley UPC

Vector Addition in UPC
define N 100THREADSshared int v1N, v2N,
sumN //cyclic layoutvoid main() for(int
i0 iltN i) if (MYTHREAD iTHREADS)
//SPMD sumiv1iv2i
5
Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ISO C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
6
UPC to C Translator
  • Based on Open64
  • Extend with shared type
  • Reuse analysis framework
  • Add UPC specific optimizations
  • Portable translation
  • High level IR
  • Config file for platform dependent information
  • Reinclude library headers
  • Convert shared memory operations into runtime
    calls

Preprocessed UPC Source
Parsing
WHIRL with shared types
Optimizer
Optimized WHIRL
Lowering
WHIRL with runtime calls
Lowering
WHIRL2C
Backend C compiler
ISO C code
7
Optimization framework
  • Combination of language/compiler/runtime support
  • Transparent to the user
  • Performance portable
  • Short term goal effective on different cluster
    networks.
  • Long term goal code designed for SMP get good
    performance on clusters

Optimize regular array accesses
Optimize irregular pointer accesses
Nonblocking bulk communication
p-gtx-gty
upc_memget(dst, src, size)
Aijk
Loop framework for message vectorization, strip
mining
PRE framework with split-phase access and
coalescing
Runtime framework for communication overlap
8
Application Performance LU Decomposition
  • UPC performance comparable to MPI/HPL(Linpack)
    with lt ½ the code size
  • Uses light-weight multi-threading atop SPMD ?
    latency tolerant
  • Highly adaptable to different problem and machine
    sizes

9
Application Performance 3D FFT
MFLOPS / Proc
up is good
  • One-sided UPC approach sends more, smaller
    messages
  • Same total volume of data, but send earlier and
    more often
  • Aggressively overlaps the transpose with the 2nd
    1-D FFT
  • Same approach is less effective in MPI due to
    higher per-message cost
  • Consistently outperforms MPI-based
    implementations by as much as 2X

10
Current Status
  • Public release v2.4 in November 2006
  • Fully compliant with UPC 1.2 specification
  • Communication optimizations
  • Extensions for performance and programmability
  • Support from laptops to supercomputers
  • OS UNIX (Linux, BSD, AIX, Solaris, etc), Mac,
    Cygwin
  • Arch x86, Itanium, Opteron, Alpha, PPC, SPARC,
    Cray X1, NEC SX-6, Blue Gene, etc.
  • Network SMP, Myrinet, Quadrics, Infiniband, IBM
    LAPI, MPI, Ethernet, SHMEM, etc.
  • Give us a try at http//upc.lbl.gov

11
Summary
  • UPC designed to be consistent with C
  • Expose memory layout
  • Flexible communication with pointers and arrays
  • Give users more control to achieve high
    performance
  • Berkeley UPC compiler provides an open-source and
    portable implementation
  • Hand optimized UPC programs match and often beat
    MPIs performance
  • Research goal productive user efficient
    compiler
Write a Comment
User Comments (0)
About PowerShow.com