VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP - PowerPoint PPT Presentation

About This Presentation
Title:

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP

Description:

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UPC Compiler Future ... VHO. inline. IPA. PREOPT. LNO. lslsl. Integrate with GasNet and the UPC runtime ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 21
Provided by: costin3
Learn more at: https://upc.lbl.gov
Category:
Tags: ipa | lno | preopt | vho | wopt | inline | lslsl | rvi1 | vho

less

Transcript and Presenter's Notes

Title: VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP


1
NERSC/LBNL UPC CompilerStatus Report
  • Costin Iancu
  • and
  • the UCB/LBL UPC group

2
UPC Compiler Status Report
  • Current Status
  • UPC-to-C translator implemented in open64.
    Compliant with rev 1.0 of the UPC spec.
  • Translates the GWU test suite and test
    programs from Intrepid.

3
UPC Compiler Future Work
lslsl
WOPT RVI1
4
UPC Compiler Future Work
  • Integrate with GasNet and the UPC runtime
  • Test runtime and translator (32/64 bit)
  • Investigate interaction between translator and
    optimization packages (legal C code)
  • UPC specific optimizations
  • Open64 code generator

lslsl
WOPT RVI1
5
UPC Optimizations - Problems
  • Shared pointer - logical tuple (addr, thread,
    phase)
  • void addr int thread int phase
  • Expensive pointer arithmetic and address
    generation
  • pi -gt p.phase(p.phasei)B
  • p.thread(p.thread(p.phasei)/B)T
  • Parallelism expressed by forall and affinity test
  • Overhead of fine grained communication can become
    prohibitive

6
Translated UPC Code
  • include ltupc.hgt
  • shared float a, b
  • int main()
  • int i, k
  • upc_forall(k7 k lt234 k ak)
  • upc_forall(i 0 i lt 1000 i 333)
  • ak bk1

7
UPC Optimizations
  • Generic scalar and loop optimizations
    (unrolling, pipelining)
  • Address generation optimizations
  • Eliminate run-time tests
  • Table lookup / Basis vectors
  • Simplify pointer/address arithmetic
  • Address components reuse
  • Localization
  • Communication optimizations
  • Vectorization
  • Message combination
  • Message pipelining
  • Prefetching for irregular data accesses

8
Run-Time Test Elimination
  • Problem find sequence of local memory locations
    that processor P accesses during the computation
  • Well explored in the context of HPF
  • Several techniques proposed for for block-cyclic
    distributions
  • table lookup (Chatterjee,Kennedy)
  • basis vectors (Ramanujam, Thirumalai)
  • UPC layouts cyclic, pure block, indefinite
    block size - particular case of block cyclic

9
Table Array Address Lookup
  • upc_forall(il iltu is ai)
  • ai EXP()

compute T, next, start base startmem i
startoffset while (base lt endmem) base
EXP() base Ti i nexti Table
based lookup (Kennedy)
10
Array Address Lookup
  • Encouraging results speedups between 50200
    versus run-time resolution
  • Lookup time vs space tradeoff . Kennedy
    introduces a demand-driven technique
  • UPC arrays simpler than HPF arrays
  • UPC language restrictions no aliasing between
    pointers with different block sizes
  • Existing HPF techniques also applicable to UPC
    pointer based programs

11
Address Arithmetic Simplification
  • Address Components Reuse
  • Idea view shared pointers as three separate
    components (A, T, P) (addr, thread, phase)
  • Exploit the implicit reuse of the thread and
    phase fields
  • Pointer Localization
  • Determine which accesses can be performed using
    local pointers
  • Optimize for indefinite block size
  • Requires heap analysis/LQI and a similar
    dependency analysis to the lookup techniques

12
Communication Optimizations
  • Message Vectorization hoist and prefetch an
    array slice.
  • Message Combination combine messages with the
    same target processor into a larger message
  • Communication Pipelining separate the
    initiation of a communication operation by its
    completion and overlap communication and
    computation

13
Communication Optimizations
  • Some optimizations are complementary
  • ChoiSnyder (Paragon/T3D -PVM/shmem),
    Krishnamurthy (CM5), Chakrabarti (SP2/Now)
  • Speedups in the range 10-40
  • Optimizations more effective for high latency
    transport layers (PVM/Now) 25 speedup vs 10
    speedup (shmem/SP2)

14
Prefetching of Irregular Data Accesses
  • For serial programs hide cache latency
  • Simpler for parallel programs hide
    communication latency
  • Irregular data accesses
  • Array based programs abi
  • Irregular data structures (pointer based)

15
Prefetching of Irregular Data Accesses
  • Array based programs
  • Well explored topic (inspector-executor
    Saltz)
  • Irregular data structures
  • Not very well explored in the context of SPMD
    programs.
  • Serial techniques jump pointers, linearization
    (Mowry)
  • Is there a good case for it?

16
Conclusions
  • We start with a clean slate
  • Infrastructure for pointer analysis, array
    dependency analysis already in open64
  • Communication optimizations and address
    calculation optimizations share common analyses
  • Address calculation optimizations are likely to
    offer better performance improvements at this
    stage

17
The End
18
Address Arithmetic Simplification
  • Address Components Reuse
  • Idea view shared pointers as three separate
    components (A, T, P) (addr, thread, phase)
  • Exploit the implicit reuse of the thread and
    phase fields
  • shared B float aN,bN upc_forall(ililtui
    sai)
  • ai bik

19
Address Component Reuse
Bi
ei
bi
ai bik a -gt (Aa,Ta,Pa) b -gt (Ab,Tb,Pb)
B-k
Ta Tb PbPak
20
Address Component Reuse
  • Ta 0
  • for (ifirst_block iltlast_block inext_block)
  • for(jbi,Pa0 j lt ei-k j,Pa)
  • put(Aa,Ta,Pa, get(Ab,Ta,Pak))
  • for( jltei j)
  • put(Aa,Ta,Pa, get(Ab,Ta1,Pa-j))
Write a Comment
User Comments (0)
About PowerShow.com