Ernest Orlando Lawrence Berkeley National Laboratory - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Ernest Orlando Lawrence Berkeley National Laboratory

Description:

Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 70
Provided by: Gabo97
Category:

less

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory


1
Compilation Technology for Computational
Science Kathy Yelick Lawrence Berkeley National
Laboratory and UC Berkeley
Joint work with The Titanium Group S. Graham,
P. Hilfinger, P. Colella, D. Bonachea, K. Datta,
E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su,
T. Wen The Berkeley UPC Group C. Bell, D.
Bonachea, W. Chen, J. Duell, P.
Hargrove, P. Husbands, C. Iancu, R. Nishtala, M.
Welcome
2
Outline
  • Computer architecture trends
  • Software trends
  • Scientific computing expertise in parallelism
  • Performance is as important as parallelism
  • Resource management is key to performance
  • Open question how much to virtualize machine?
  • Parallel language problems PGAS solutions
  • Virtualize global address space
  • Not shared virtual memory, not virtual processor
    space
  • Parallel compiler problems/solutions

3
Parallelism Everywhere
  • Single processor Moores Law effect is ending
  • Power density limitations device physics below
    90nm
  • Multicore is becoming the norm
  • AMD, IBM, Intel, Sun all offering multicore
  • Number of cores per chip likely to increase with
    density
  • Fundamental software change
  • Parallelism is exposed to software
  • Performance is no longer solely a hardware
    problem
  • What has the HPC community learned?
  • Caveat Scale and applications differ

4
High-end simulation in the physical sciences 7
methods
Phillip Colellas Seven dwarfs
  1. Structured Grids (including Adaptive Mesh
    Refinement)
  2. Unstructured Grids
  3. Spectral Methods (FFTs, etc.)
  4. Dense Linear Algebra
  5. Sparse Linear Algebra
  6. Particles
  7. Monte Carlo Simulation
  • Add 4 for embedded covers all 41 EEMBC
    benchmarks
  • 8. Search/Sort
  • 9. Filter
  • 10. Comb. logic
  • 11. Finite State Machine

Note Data sizes (8 bit to 32 bit) and types
(integer, character) differ, but algorithms the
same Games/Entertainment close to scientific
computing
Slide source Phillip Colella, 2004 and Dave
Patterson, 2006
5
Parallel Programming Models
  • Parallel software is still an unsolved problem !
  • Most parallel programs are written using either
  • Message passing with a SPMD model
  • for scientific applications scales easily
  • Shared memory with threads in OpenMP, Threads, or
    Java
  • non-scientific applications easier to program
  • Partitioned Global Address Space (PGAS) Languages
    off 3 features
  • Productivity easy to understand and use
  • Performance primary requirement in HPC
  • Portability must run everywhere

6
Partitioned Global Address Space
  • Global address space any thread/process may
    directly read/write data allocated by another
  • Partitioned data is designated as local (near)
    or global (possibly far) programmer controls
    layout
  • By default
  • Object heaps are shared
  • Program stacks are private

x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn
  • 3 Current languages UPC, CAF, and Titanium
  • Emphasis in this talk on UPC Titanium (based on
    Java)

7
PGAS Language Overview
  • Many common concepts, although specifics differ
  • Consistent with base language
  • Both private and shared data
  • int x10 and shared int y10
  • Support for distributed data structures
  • Distributed arrays local and global
    pointers/references
  • One-sided shared-memory communication
  • Simple assignment statements xi yi
    or t p
  • Bulk operations memcpy in UPC, array ops in
    Titanium and CAF
  • Synchronization
  • Global barriers, locks, memory fences
  • Collective Communication, IO libraries, etc.

8
Example Titanium Arrays
  • Ti Arrays created using Domains indexed using
    Points
  • double 3d gridA new double
    0,0,010,10,10
  • Eliminates some loop bound errors using foreach
  • foreach (p in gridA.domain())
  • gridAp gridApc gridBp
  • Rich domain calculus allow for slicing, subarray,
    transpose and other operations without data
    copies
  • Array copy operations automatically work on
    intersection
  • dataneighborPos.copy(mydata)

intersection (copied area)
restrict-ed (non-ghost) cells
ghost cells
mydata
dataneighorPos
9
Productivity Line Count Comparison
  • Comparison of NAS Parallel Benchmarks
  • UPC version has modest programming effort
    relative to C
  • Titanium even more compact, especially for MG,
    which uses multi-d arrays
  • Caveat Titanium FT has user-defined Complex type
    and cross-language support used to call FFTW for
    serial 1D FFTs

UPC results from Tarek El-Gazhawi et al CAF from
Chamberlain et al Titanium joint with Kaushik
Datta Dan Bonachea
10
Case Study 1 Block-Structured AMR
  • Adaptive Mesh Refinement (AMR) is challenging
  • Irregular data accesses and control from
    boundaries
  • Mixed global/local view is useful

Titanium AMR benchmarks available
AMR Titanium work by Tong Wen and Philip Colella
11
AMR in Titanium
  • C/Fortran/MPI AMR
  • Chombo package from LBNL
  • Bulk-synchronous comm
  • Pack boundary data between procs
  • Titanium AMR
  • Entirely in Titanium
  • Finer-grained communication
  • No explicit pack/unpack code
  • Automated in runtime system

Code Size in Lines Code Size in Lines Code Size in Lines
C/Fortran/MPI Titanium
AMR data Structures 35000 2000
AMR operations 6500 1200
Elliptic PDE solver 4200 1500
10X reduction in lines of code!
Somewhat more functionality in PDE part of
Chombo code
Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
12
Performance of Titanium AMR
Comparable performance
  • Serial Titanium is within a few of C/F
    sometimes faster!
  • Parallel Titanium scaling is comparable with
    generic optimizations
  • - additional optimizations (namely overlap)
    not yet implemented

13
Immersed Boundary Simulation in Titanium
  • Modeling elastic structures in an incompressible
    fluid.
  • Blood flow in the heart, blood clotting, inner
    ear, embryo growth, and many more
  • Complicated parallelization
  • Particle/Mesh method
  • Particles connected into materials

Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Joint work with Ed Givelberg, Armando Solar-Lezama
14
High Performance
  • Strategy for acceptance of a new language
  • Within HPC Make it run faster than anything else
  • Approaches to high performance
  • Language support for performance
  • Allow programmers sufficient control over
    resources for tuning
  • Non-blocking data transfers, cross-language
    calls, etc.
  • Control over layout, load balancing, and
    synchronization
  • Compiler optimizations reduce need for hand
    tuning
  • Automate non-blocking memory operations, relaxed
    memory,
  • Productivity gains though parallel analysis and
    optimizations
  • Runtime support exposes best possible performance
  • Berkeley UPC and Titanium use GASNet
    communication layer
  • Dynamic optimizations based on runtime information

15
One-Sided vs Two-Sided
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload
  • A one-sided put/get message can be handled
    directly by a network interface with RDMA support
  • Avoid interrupting the CPU or storing data from
    CPU (preposts)
  • A two-sided messages needs to be matched with a
    receive to identify memory address to put data
  • Offloaded to Network Interface in networks like
    Quadrics
  • Need to download match tables to interface (from
    host)

16
Performance Advantage of One-Sided Communication
GASNet vs MPI
  • Opteron/InfiniBand (Jacquard at NERSC)
  • GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
  • Half power point (N ½ ) differs by one order of
    magnitude

Joint work with Paul Hargrove and Dan Bonachea
17
GASNet Portability and High-Performance
GASNet better for latency across machines
Joint work with UPC Group GASNet design by Dan
Bonachea
18
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
19
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Joint work with UPC Group GASNet design by Dan
Bonachea
20
Case Study 2 NAS FT
  • Performance of Exchange (Alltoall) is critical
  • 1D FFTs in each dimension, 3 phases
  • Transpose after first 2 for locality
  • Bisection bandwidth-limited
  • Problem as procs grows
  • Three approaches
  • Exchange
  • wait for 2nd dim FFTs to finish, send 1 message
    per processor pair
  • Slab
  • wait for chunk of rows destined for 1 proc, send
    when ready
  • Pencil
  • send each row as it completes

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
21
Overlapping Communication
  • Goal make use of all the wires all the time
  • Schedule communication to avoid network backup
  • Trade-off overhead vs. overlap
  • Exchange has fewest messages, less message
    overhead
  • Slabs and pencils have more overlap pencils the
    most
  • Example Class D problem on 256 Processors

Exchange (all data at once) 512 Kbytes
Slabs (contiguous rows that go to 1 processor) 64 Kbytes
Pencils (single row) 16 Kbytes
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
22
NAS FT Variants Performance Summary
.5 Tflops
  • Slab is always best for MPI small message cost
    too high
  • Pencil is always best for UPC more overlap

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
23
Case Study 3 LU Factorization
  • Direct methods have complicated dependencies
  • Especially with pivoting (unpredictable
    communication)
  • Especially for sparse matrices (dependence graph
    with holes)
  • LU Factorization in UPC
  • Use overlap ideas and multithreading to mask
    latency
  • Multithreaded UPC threads user threads
    threaded BLAS
  • Panel factorization Including pivoting
  • Update to a block of U
  • Trailing submatrix updates
  • Status
  • Dense LU done HPL-compliant
  • Sparse version underway

Joint work with Parry Husbands
24
UPC HPL Performance
  • MPI HPL numbers from HPCC database
  • Large scaling
  • 2.2 TFlops on 512p,
  • 4.4 TFlops on 1024p (Thunder)
  • Comparison to ScaLAPACK on an Altix, a 2 x 4
    process grid
  • ScaLAPACK (block size 64) 25.25 GFlop/s (tried
    several block sizes)
  • UPC LU (block size 256) - 33.60 GFlop/s, (block
    size 64) - 26.47 GFlop/s
  • n 32000 on a 4x4 process grid
  • ScaLAPACK - 43.34 GFlop/s (block size 64)
  • UPC - 70.26 Gflop/s (block size 200)

Joint work with Parry Husbands
25
Automating Support for Optimizations
  • The previous examples are hand-optimized
  • Non-blocking put/get on distributed memory
  • Relaxed memory consistency on shared memory
  • What analyses are needed to optimize parallel
    codes?
  • Concurrency analysis determine which blocks of
    code could run in parallel
  • Alias analysis determine which variables could
    access the same location
  • Synchronization analysis align matching
    barriers, locks
  • Locality analysis when is a general (global
    pointer) used only locally (can convert to
    cheaper local pointer)

Joint work with Amir Kamil and Jimmy Su
26
Reordering in Parallel Programs
In parallel programs, a reordering can change the
semantics even if no local dependencies exist.
Initially, flag data 0
T1
T1
data 1
flag 1
T2
T2
f flag
f flag
d data
d data
flag 1
data 1
f 1, d 0 is possible after reordering
not in original
Compiler, runtime, and hardware can produce such
reorderings
Joint work with Amir Kamil and Jimmy Su
27
Memory Models
  • Sequential consistency a reordering is illegal
    if it can be observed by another thread
  • Relaxed consistency reordering may be observed,
    but local dependencies and synchronization
    preserved (roughly)
  • Titanium, Java, UPC are not sequentially
    consistent
  • Perceived cost of enforcing it is too high
  • For Titanium and UPC, network latency is the cost
  • For Java shared memory fences and code
    transformations are the cost

Joint work with Amir Kamil and Jimmy Su
28
Software and Hardware Reordering
  • Compiler can reorder accesses as part of an
    optimization
  • Example copy propagation
  • Logical fences inserted where reordering is
    illegal optimizations respect these fences
  • Hardware can reorder accesses
  • Examples out of order execution, remote accesses
  • Fence instructions inserted into generated code
    waits until all prior memory operations have
    completed
  • Can cost a complete round trip time due to remote
    accesses

Joint work with Amir Kamil and Jimmy Su
29
Conflicts
  • Reordering of an access is observable only if it
    conflicts with some other access
  • The accesses can be to the same memory location
  • At least one access is a write
  • The accesses can run concurrently
  • Fences (compiler and hardware) need to be
    inserted around accesses that conflict

T1
T2
data 1
f flag
flag 1
d data
Conflicts
Joint work with Amir Kamil and Jimmy Su
30
Sequential Consistency in Titanium
  • Minimize number of fences allow same
    optimizations as relaxed model
  • Concurrency analysis identifies concurrent
    accesses
  • Relies on Titaniums textual barriers and
    single-valued expressions
  • Alias analysis identifies accesses to the same
    location
  • Relies on SPMD nature of Titanium

Joint work with Amir Kamil and Jimmy Su
31
Barrier Alignment
  • Many parallel languages make no attempt to ensure
    that barriers line up
  • Example code that is legal but will deadlock
  • if (Ti.thisProc() 2 0)
  • Ti.barrier() // even ID threads
  • else
  • // odd ID threads

Joint work with Amir Kamil and Jimmy Su
32
Structural Correctness
  • Aiken and Gay introduced structural correctness
    (POPL98)
  • Ensures that every thread executes the same
    number of barriers
  • Example of structurally correct code
  • if (Ti.thisProc() 2 0)
  • Ti.barrier() // even ID threads
  • else
  • Ti.barrier() // odd ID threads

Joint work with Amir Kamil and Jimmy Su
33
Textual Barrier Alignment
  • Titanium has textual barriers all threads must
    execute the same textual sequence of barriers
  • Stronger guarantee than structural correctness
    this example is illegal
  • if (Ti.thisProc() 2 0)
  • Ti.barrier() // even ID threads
  • else
  • Ti.barrier() // odd ID threads
  • Single-valued expressions used to enforce textual
    barriers

Joint work with Amir Kamil and Jimmy Su
34
Single-Valued Expressions
  • A single-valued expression has the same value on
    all threads when evaluated
  • Example Ti.numProcs() gt 1
  • All threads guaranteed to take the same branch of
    a conditional guarded by a single-valued
    expression
  • Only single-valued conditionals may have barriers
  • Example of legal barrier use
  • if (Ti.numProcs() gt 1)
  • Ti.barrier() // multiple threads
  • else
  • // only one thread total

Joint work with Amir Kamil and Jimmy Su
35
Concurrency Analysis
  • Graph generated from program as follows
  • Node added for each code segment between barriers
    and single-valued conditionals
  • Edges added to represent control flow between
    segments

1
// code segment 1 if (single) // code segment
2 else // code segment 3 // code segment
4 Ti.barrier() // code segment 5
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
36
Concurrency Analysis (II)
  • Two accesses can run concurrently if
  • They are in the same node, or
  • One accesss node is reachable from the other
    accesss node without hitting a barrier
  • Algorithm remove barrier edges, do DFS

1
Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments
1 2 3 4 5
1 X X X X
2 X X X
3 X X X
4 X X X X
5 X
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
37
Alias Analysis
  • Allocation sites correspond to abstract locations
    (a-locs)
  • All explicit and implict program variables have
    points-to sets
  • A-locs are typed and have points-to sets for each
    field of the corresponding type
  • Arrays have a single points-to set for all
    indices
  • Analysis is flow,context-insensitive
  • Experimental call-site sensitive version
    doesnt seem to help much

Joint work with Amir Kamil and Jimmy Su
38
Thread-Aware Alias Analysis
  • Two types of abstract locations local and remote
  • Local locations reside in local threads memory
  • Remote locations reside on another thread
  • Exploits SPMD property
  • Results are a summary over all threads
  • Independent of the number of threads at runtime

Joint work with Amir Kamil and Jimmy Su
39
Alias Analysis Allocation
  • Creates new local abstract location
  • Result of allocation must reside in local memory

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a
b
c
Joint work with Amir Kamil and Jimmy Su
40
Alias Analysis Assignment
  • Copies source abstract locations into points-to
    set of target

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a 1
b
c 1
1.z 2
Joint work with Amir Kamil and Jimmy Su
41
Alias Analysis Broadcast
  • Produces both local and remote versions of source
    abstract location
  • Remote a-loc points to remote analog of what
    local a-loc points to

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2, 1r
Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
1.z 2
1r.z 2r
Joint work with Amir Kamil and Jimmy Su
42
Aliasing Results
  • Two variables A and B may alias if
  • xÎpointsTo(A).
  • xÎpointsTo(B)
  • Two variables A and B may alias across threads
    if
  • xÎpointsTo(A).
  • R(x)ÎpointsTo(B),
  • (where R(x) is the remote counterpart of x)

Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
Alias Across Threads Alias Across Threads
a b, c b
b a, c a, c
c a, b b
Joint work with Amir Kamil and Jimmy Su
43
Benchmarks
Benchmark Lines1 Description
pi 56 Monte Carlo integration
demv 122 Dense matrix-vector multiply
sample-sort 321 Parallel sort
lu-fact 420 Dense linear algebra
3d-fft 614 Fourier transform
gsrb 1090 Computational fluid dynamics kernel
gsrb 1099 Slightly modified version of gsrb
spmv 1493 Sparse matrix-vector multiply
gas 8841 Hyperbolic solver for gas dynamics
1 Line counts do not include the reachable
portion of the 1 37,000 line Titanium/Java 1.0
libraries
Joint work with Amir Kamil and Jimmy Su
44
Analysis Levels
  • We tested analyses of varying levels of precision

Analysis Description
naïve All heap accesses
sharing All shared accesses
concur Concurrency analysis type-based AA
concur/saa Concurrency analysis sequential AA
concur/taa Concurrency analysis thread-aware AA
concur/taa/cycle Concurrency analysis thread-aware AA cycle detection
Joint work with Amir Kamil and Jimmy Su
45
Static (Logical) Fences
GOOD
Percentages are for number of static fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
46
Dynamic (Executed) Fences
GOOD
Percentages are for number of dynamic fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
47
Dynamic Fences gsrb
  • gsrb relies on dynamic locality checks
  • slight modification to remove checks (gsrb)
    greatly increases precision of analysis

GOOD
Joint work with Amir Kamil and Jimmy Su
48
Two Example Optimizations
  • Consider two optimizations for GAS languages
  • Overlap bulk memory copies
  • Communication aggregation for irregular array
    accesses (i.e. abi)
  • Both optimizations reorder accesses, so
    sequential consistency can inhibit them
  • Both are addressing network performance, so
    potential payoff is high

Joint work with Amir Kamil and Jimmy Su
49
Array Copies in Titanium
  • Array copy operations are commonly used
  • dst.copy(src)
  • Content in the domain intersection of the two
    arrays is copied from dst to src
  • Communication (possibly with packing) required if
    arrays reside on different threads
  • Processor blocks until the operation is complete.




src



dst
Joint work with Amir Kamil and Jimmy Su
50
Non-Blocking Array Copy Optimization
  • Automatically convert blocking array copies into
    non-blocking array copies
  • Push sync as far down the instruction stream as
    possible to allow overlap with computation
  • Interprocedural syncs can be moved across method
    boundaries
  • Optimization reorders memory accesses may be
    illegal under sequential consistency

Joint work with Amir Kamil and Jimmy Su
51
Communication Aggregation on Irregular Array
Accesses (Inspector/Executor)
  • A loop containing indirect array accesses is
    split into phases
  • Inspector examines loop and computes reference
    targets
  • Required remote data gathered in a bulk operation
  • Executor uses data to perform actual computation
  • Can be illegal under sequential consistency

schd inspect(remote, b) tmp get(remote,
schd) for (...) ai tmpi // other
accesses
for (...) ai remotebi // other
accesses
Joint work with Amir Kamil and Jimmy Su
52
Relaxed SC with 3 Analyses
  • We tested performance using analyses of varying
    levels of precision

Name Description
relaxed Uses Titaniums relaxed memory model
naïve Uses sequential consistency, puts fences around every heap access
sharing Uses sequential consistency, puts fences around every shared heap access
concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis
Joint work with Amir Kamil and Jimmy Su
53
Dense Matrix Vector Multiply
  • Non-blocking array copy optimization applied
  • Strongest analysis is necessary other SC
    implementations suffer relative to relaxed

Joint work with Amir Kamil and Jimmy Su
54
Sparse Matrix Vector Multiply
  • Inspector/executor optimization applied
  • Strongest analysis is again necessary and
    sufficient

Joint work with Amir Kamil and Jimmy Su
55
Portability of Titanium and UPC
  • Titanium and the Berkeley UPC translator use a
    similar model
  • Source-to-source translator (generate ISO C)
  • Runtime layer implements global pointers, etc
  • Common communication layer (GASNet)
  • Both run on most PCs, SMPs, clusters
    supercomputers
  • Support Operating Systems
  • Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris,
    Cygwin, MacOSX, Unicos, SuperUX
  • UPC translator somewhat less portable we provide
    a http-based compile server
  • Supported CPUs
  • x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC,
    Opteron
  • GASNet communication
  • Myrinet GM, Quadrics Elan, Mellanox Infiniband
    VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI
    SHMEM, and (for portability) MPI and UDP
  • Specific supercomputer platforms
  • HP AlphaServer, Cray X1, IBM SP, NEC SX-6,
    Cluster X (Big Mac), SGI Altix 3000
  • Underway Cray XT3, BG/L (both run over MPI)
  • Can be mixed with MPI, C/C, Fortran

Also used by gcc/upc
Joint work with Titanium and UPC groups
56
Portability of PGAS Languages
  • Other compilers also exist for PGAS Languages
  • UPC
  • Gcc/UPC by Intrepid runs on GASNet
  • HP UPC for AlphaServers, clusters,
  • MTU UPC uses HP compiler on MPI (source to
    source)
  • Cray UPC
  • Co-Array Fortran
  • Cray CAF Compiler X1, X1E
  • Rice CAF Compiler (on ARMCI or GASNet), John
    Mellor-Crummey
  • Source to source
  • Processors Pentium, Itanium2, Alpha, MIPS
  • Networks Myrinet, Quadrics, Altix, Origin,
    Ethernet
  • OS Linux32 RedHat, IRIS, Tru64
  • NB source-to-source requires cooperation by
    backend compilers

57
Summary
  • PGAS languages offer productivity advantage
  • Order of magnitude in line counts for grid-based
    code in Titanium
  • Push decisions about packing/not into runtime for
    portability (advantage of language with
    translator vs. library approach)
  • Significant work in compiler can make programming
    easier
  • PGAS languages offer performance advantages
  • Good match to RDMA support in networks
  • Smaller messages may be faster
  • make better use of network postpone bisection
    bandwidth pain
  • can also prevent cache thrashing for packing
  • Have locality advantages that may help even SMPs
  • Source-to-source translation
  • The way to ubiquity
  • Complement highly tuned machine-specific compilers

58
End of Slides
59
Productizing BUPC
  • Recent Berkeley UPC release
  • Support full 1.2 language spec
  • Supports collectives (tuning ongoing) memory
    model compliance
  • Supports UPC I/O (naïve reference implementation)
  • Large effort in quality assurance and robustness
  • Test suite 600 tests run nightly on 20
    platform configs
  • Tests correct compilation execution of UPC and
    GASNet
  • gt30,000 UPC compilations and gt20,000 UPC test
    runs per night
  • Online reporting of results hookup with bug
    database
  • Test suite infrastructure extended to support any
    UPC compiler
  • now running nightly with GCC/UPC UPCR
  • also support HP-UPC, Cray UPC,
  • Online bug reporting database
  • Over gt1100 reports since Jan 03
  • gt 90 fixed (excl. enhancement requests)

60
NAS FT UPC Non-blocking MFlops
  • Berkeley UPC compiler support non-blocking UPC
    extensions
  • Produce 15-45 speedup over best UPC Blocking
    version
  • Non-blocking version requires about 30 extra
    lines of UPC code

61
Benchmarking
  • Next few UPC and MPI application benchmarks use
    the following systems
  • Myrinet Myrinet 2000 PCI64B, P4-Xeon 2.2GHz
  • InfiniBand IB Mellanox Cougar 4X HCA, Opteron
    2.2GHz
  • Elan3 Quadrics QsNet1, Alpha 1GHz
  • Elan4 Quadrics QsNet2, Itanium2 1.4GHz

62
PGAS Languages Key to High Performance
  • One way to gain acceptance of a new language
  • Make it run faster than anything else
  • Keys to high performance
  • Parallelism
  • Scaling the number of processors
  • Maximize single node performance
  • Generate friendly code or use tuned libraries
    (BLAS, FFTW, etc.)
  • Avoid (unnecessary) communication cost
  • Latency, bandwidth, overhead
  • Avoid unnecessary delays due to dependencies
  • Load balance
  • Pipeline algorithmic dependencies

63
Hardware Latency
  • Network latency is not expected to improve
    significantly
  • Overlapping communication automatically (Chen)
  • Overlapping manually in the UPC applications
    (Husbands, Welcome, Bell, Nishtala)
  • Language support for overlap (Bonachea)

64
Effective Latency
  • Communication wait time from other factors
  • Algorithmic dependencies
  • Use finer-grained parallelism, pipeline tasks
    (Husbands)
  • Communication bandwidth bottleneck
  • Message time is Latency 1/Bandwidth Size
  • Too much aggregation hurts wait for bandwidth
    term
  • De-aggregation optimization automatic (Iancu)
  • Bisection bandwidth bottlenecks
  • Spread communication throughout the computation
    (Bell)

65
Fine-grained UPC vs. Bulk-Synch MPI
  • How to waste money on supercomputers
  • Pack all communication into single message (spend
    memory bandwidth)
  • Save all communication until the last one is
    ready (add effective latency)
  • Send all at once (spend bisection bandwidth)
  • Or, to use what you have efficiently
  • Avoid long wait times send early and often
  • Use all the wires, all the time
  • This requires having low overhead!

66
What You Wont Hear Much About
  • Compiler/runtime/gasnet bug fixes, performance
    tuning, testing,
  • gt13,000 e-mail messages regarding cvs checkins
  • Nightly regression testing
  • 25 platforms, 3 compilers (head, opt-branch,
    gcc-upc),
  • Bug reporting
  • 1177 bug reports, 1027 fixed
  • Release scheduled for later this summer
  • Beta is available
  • Process significantly streamlined

67
Take-Home Messages
  • Titanium offers tremendous gains in productivity
  • High level domain-specific array abstractions
  • Titanium is being used for real applications
  • Not just toy problems
  • Titanium and UPC are both highly portable
  • Run on essentially any machine
  • Rigorously tested and supported
  • PGAS Languages are Faster than two-sided MPI
  • Better match to most HPC networks
  • Berkeley UPC and Titanium benchmarks
  • Designed from scratch with one-side PGAS model
  • Focus on 2 scalability challenges AMR and Sparse
    LU

68
Titanium Background
  • Based on Java, a cleaner C
  • Classes, automatic memory management, etc.
  • Compiled to C and then machine code, no JVM
  • Same parallelism model at UPC and CAF
  • SPMD parallelism
  • Dynamic Java threads are not supported
  • Optimizing compiler
  • Analyzes global synchronization
  • Optimizes pointers, communication, memory

69
Do these Features Yield Productivity?
Joint work with Kaushik Datta, Dan Bonachea
70
GASNet/X1 Performance
single word get
single word put
  • GASNet/X1 improves small message performance over
    shmem and MPI
  • Leverages global pointers on X1
  • Highlights advantage of languages vs. library
    approach

Joint work with Christian Bell, Wei Chen and Dan
Bonachea
71
High Level Optimizations in Titanium
  • Irregular communication can be expensive
  • Best strategy differs by data size/distribution
    and machine parameters
  • E.g., packing, sending bounding boxes,
    fine-grained are
  • Use of runtime optimizations
  • Inspector-executor
  • Performance on Sparse MatVec Mult
  • Results best strategy differs within the machine
    on a single matrix ( 50 better)

Speedup relative to MPI code (Aztec library)
Average and maximum speedup of the Titanium
version relative to the Aztec version on 1 to 16
processors
Joint work with Jimmy Su
72
Source to Source Strategy
  • Source-to-source translation strategy
  • Tremendous portability advantage
  • Still can perform significant optimizations
  • Relies on high quality back-end compilers and
    some coaxing in code generation

48x
  • Use of restrict pointers in C
  • Understand Multi-D array indexing (Intel/Itanium
    issue)
  • Support for pragmas like IVDEP
  • Robust vectorizators (X1, SSE, NEC,)
  • On machines with integrated shared memory
    hardware need access to shared memory operations

Joint work with Jimmy Su
Write a Comment
User Comments (0)
About PowerShow.com