UPC Tutorial Part 2 - PowerPoint PPT Presentation

Loading...

PPT – UPC Tutorial Part 2 PowerPoint presentation | free to view - id: aac90-OTIzY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

UPC Tutorial Part 2

Description:

UPC Program Examples. Shared data allocation. Common usage: static allocation of shared variables ... Common usage: coordinate activities across threads (make ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 26
Provided by: adam94
Category:
Tags: upc | part | tutorial

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: UPC Tutorial Part 2


1
UPC Tutorial Part 2
  • Hung-Hsun Su
  • UPC Group
  • 1/30/2006
  • HCS Research Laboratory
  • University of Florida

2
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

3
Quick Review
  • UPC - Unified Parallel C
  • An explicitly-parallel extension of ANSI C
  • A partitioned global addressing space (PGAS)
    parallel programming language
  • Provide constructs for
  • Declaration of global shared variables
  • Reading/writing of global shared variables
  • Synchronization (barrier, fence, lock)
  • Work sharing (upc_forall)

4
Quick Review
  • UPC is easier to program in than MPI
  • Scales better
  • No need to match send and receives
  • UPC program performance comparable to MPI program
    (upc program better for irregular communication)
  • UPC consortium of government, academia, HPC
    vendors, including
  • ARSC, Compaq, CSC, Cray Inc., Etnus, GWU, HP,
    IBM, IDA CSC, Intrepid Technologies, LBNL, LLNL,
    MTU, NSA, UCB, UMCP, UF, US DoD, US DoE, OSU
  • http//upc.gwu.edu

5
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

6
UPC Program Examples
  • Shared data allocation
  • Common usage static allocation of shared
    variables
  • include ltupc_relaxed.hgt
  • shared int aTHREADSTHREADS
  • shared int bTHREADS, cTHREADS
  • void main (void)
  • int i, j
  • upc_forall( i 0 i lt THREADS i i)
  • ci 0
  • for ( j 0 j lt THREADS j)
  • ci aijbj

7
UPC Program Examples
  • Work sharing upc_forall
  • Common usage direct each thread to operate on
    the local portion of shared arrays (read/write)
  • include ltupc_relaxed.hgt
  • define N 100THREADS
  • shared int v1N, v2N, v1plusv2N
  • void main()
  • int i
  • upc_forall(i0 iltN i i)
  • v1plusv2iv1iv2i

If (MYTHREAD i) do work
8
UPC Program Examples
  • Shared pointers
  • Common usage provide alternative way to access
    shared data
  • include ltupc_relaxed.hgt
  • define N 100THREADS
  • shared int v1N, v2N, v1plusv2N
  • void main()
  • int i
  • shared int p1, p2
  • p1v1 p2v2
  • upc_forall(i0 iltN i, p1, p2 i)
  • v1plusv2ip1p2

PP local access PS used most often SP no
real use SS useful only in rare cases
9
UPC Program Examples
  • Dynamic memory management
  • Common usage reduces the overall system memory
    requirement tradeoff increase scalability vs.
    reduced performance
  • Synchronizations
  • Common usage coordinate activities across
    threads (make sure that 2 threads see the same
    set of data)
  • Usage greatly affects the overall performance of
    the program

10
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

11
UPC Extensions
  • Not part of the UPC specification
  • Optimized library containing useful functions
  • Collective
  • upc_all_broadcast
  • upc_all_scatter, upc_all_gather,
    upc_all_gather_all
  • upc_all_exchange, upc_all_permute
  • upc_all_reduce, upc_all_prefix_reduce
  • Different set implemented by various compilers
  • Parallel I/O
  • upc_all_fopen, upc_all_fclose, upc_all_fread,
    upc_all_fwrite, etc.
  • Still at evaluation stage, not fully supported by
    any UPC compiler

12
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

13
UPC Implementations
  • Many UPC implementations are available
  • Cray X1, X1E
  • HP AlphaServer SC and Linux Itanium (Superdome)
    systems
  • IBM BlueGene and AIX
  • Intrepid GCC SGI IRIX, Cray T3D/E, Linux Itanium
    and x86/x86-64 SMPs
  • Michigan MuPC reference implementation
  • Berkeley UPC Compiler just about everything else

14
UPC Implementations
  • Two approaches
  • True compiler (UPC code ? UPC object code) SGI
  • Runtime library (UPC code ? C code with call to
    runtime library ? C object code) Rest
  • SMP machines (HP, Cray, SGI)
  • Takes advantage of the HW support for fast global
    memory read/write
  • Most shared variable access become local access
  • Michigan UPC
  • Reference implementation that uses MPI for data
    exchange and synchronization on cluster

15
UPC Implementations
  • Berkeley UPC
  • Data exchange and synchronization thru Global
    Addressing Space Network (GASNet) communication
    library
  • Each high-speed local area interconnect
    (Quadrics, Infiniband, Myrinet, SCI, etc)
    supported has its own GASNet implementation
  • Two layers
  • Core based on Active Message (AM), takes
    advantage of the interconnect HW to provide fast
    message exchange
  • Extended provides optimized put/get either thru
    Core layer or direct use of interconnect HW

16
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

17
UPC Related Projects (other than compiler)
  • GWU
  • Test suite
  • Benchmarking
  • MTU
  • UPC runtime performance analytical modeling
  • Army High Performance Computing Research Center
  • Applications - CFD
  • Etnus
  • UPC debugger
  • UF
  • UPC/SHMEM performance analysis tool (PAT)

18
Motivation for PAT development
  • Recap UPC optimization techniques
  • Space privatization use pointer-to-locals
    instead of pointer-to-shareds when dealing with
    local shared data (remove extra level of pointer
    translation)
  • Block moves use block copy instead of copying
    elements one by one with a loop, through string
    operations or structures upc_memcpy (minimized
    remote access overhead)
  • Latency hiding overlap remote accesses with
    local processing using split-phase barriers (good
    in theory but tough to implement efficiently ?
    need PAT)
  • Data layout strive to minimize remote data
    accesses by keeping data close to computation
    (layout affects performance greatly ? PAT can
    help determining the best layout)

19
Example data placement
shared int aTHREADSTHREADS shared int
bTHREADS, cTHREADS
Original distribution
shared THREADS int aTHREADSTHREADS shared
int bTHREADS, cTHREADS
Better distribution no remote access
20
Need for Performance Analysis
  • Performance analysis of sequential applications
    can be challenging
  • Performance analysis of explicitly communicating
    parallel applications is significantly more
    difficult
  • Mainly due to increase in number of processing
    nodes
  • Performance analysis of Implicitly communicating
    parallel applications is even more difficult
  • Non-blocking, one-sided communication is tricky
    to track and analyze accurately

21
Performance analysis
  • Instrumentation user-assisted or automatic
    insertion of instrumentation code
  • Measurement actual measuring stage
  • Analysis filtering, aggregation, analysis of
    data gathered
  • Presentation display of analyzed data to user,
    deals directly with user
  • Optimization process of finding and resolving
    bottlenecks

22
UF UPC/SHMEM PAT Framework
23
Outline
  • Quick Review
  • Program Examples
  • Extensions
  • Implementations
  • Related Projects
  • Homework

24
Homework
  • Running UPC program http//docs.hcs.ufl.edu/faq/PC
    A
  • Implement the bubble sort algorithm using UPC for
    any arbitrary sized data set (for simplicity and
    efficiency, use NTHREADS). Optimize the code as
    much as possible (Hint maximize work
    distribution at various stages), discuss these
    optimization strategies.
  • Compare and discuss execution time (for 1, 2, 4,
    8 nodes/threads) and programmability of the MPI
    version and UPC version

25
References
  • UPC specification v1.2 http//www.gwu.edu/upc/doc
    s/upc_specs_1.2.pdf
  • UPC tutorial http//upc.gwu.edu/downloads/upc_tut0
    4.pdf
About PowerShow.com