OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel - PowerPoint PPT Presentation

About This Presentation
Title:

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel

Description:

OpenMP for Networks of SMPs. Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ... To enable the programmer to reply on a single, standard, shared-memory ... – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 26
Provided by: vicky48
Category:

less

Transcript and Presenter's Notes

Title: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel


1
OpenMP for Networks of SMPsY. Charlie Hu,
Honghui Lu, Alan L. Cox, Willy Zwaenepoel
  • ECE1747 Parallel Programming
  • Vicky Tsang

2
Background
  • Published in the Journal of Parallel and
    Distributed Computing, vol. 60 (12), pp.
    1512-1530, December 2000
  • Work to further improve TreadMarks
  • Presents an alternative solution to MPI

3
Roadmap
  • Motivation
  • Solution
  • OpenMP API
  • TreadMarks
  • OpenMP Translator
  • Performance Measurement
  • Results
  • Conclusion

4
Motivation
  • To enable the programmer to reply on a single,
    standard, shared-memory API for parallelization
    within and between multiprocessors.
  • To provide another standard other than MPI?

5
Solution
  • Presents the first system that implements OpenMP
    on a network of shared-memory multiprocessors
  • Implemented via a translator converting OpenMP
    directives to calls in modified TreadMarks
  • Modified TreadMarks uses POSIX threads for
    parallelism within an SMP node

6
Solution
  • Original version of TreadMarks
  • A Unix process was executed on each processor of
    the multiprocessor node and communication between
    processes was achieved through message passing
  • Fails to take advantage of hardware shared memory

7
Solution
  • Modified version of TreadMarks
  • POSIX threads used to implement parallelism
  • OpenMP threads within a multiprocessor share a
    single address space
  • Positive
  • Reduces the number of changes to TreadMarks to
    support multithreading on a multiprocessor
  • OS maintains the coherence of page mappings
    automatically
  • Negative
  • More difficult to provide uniform sharing of
    memory between threads on the same node and
    threads on different nodes

8
OpenMP API
  • Three kinds of directives
  • Parallelism/work sharing
  • Data environment
  • Synchronization
  • Based on a fork-join model
  • Sequential code sections executed by master
    thread
  • Parallel code sections are executed by all
    threads, including the master thread

9
OpenMP API
  • Parallel directive all threads perform the same
    computation
  • Work sharing directive computation is divided
    among the threads
  • Data environment directive control the sharing
    of program variables
  • Synchronization directive control the
    synchronization between threads

10
TreadMarks
  • User-level SDSM system
  • Provides a global shared address space on top of
    physically distributed memories
  • Key functions performed are memory coherence and
    synchronization

11
TreadMarks Memory Coherence
  • Minimize the amount of communication performed to
    maintain memory consistency by
  • a lazy implementation of release consistency
  • reducing the impact of false sharing by allowing
    multiple concurrent writers to modify a page
  • Propagation of consistency information is
    postponed until the time of an acquire

12
TreadMarks - Synchronization
  • Barrier implemented as acquire and release
    messages
  • Governed by a centralized manager

13
TreadMarks Modifications for OpenMP
  • Inclusion of two primitives
  • Tmk_fork
  • Tmk_join
  • All threads created at the start of a programs
    execution to minimize overhead.
  • Slave threads are blocked during sequential
    execution until the next Tmk_fork is issued by
    the master thread.

14
TreadMarks Modifications for Networks of
Multiprocessors
  • POSIX thread enabled sharing of data between
    processors. Addition of some data structures,
    such as message buffers, in thread-private memory
    for data that is to remain private within a
    thread.
  • A per-page mutex was added to allow greater
    concurrency in the page fault handler.
  • Synchronization functions in TreadMarks were
    modified to use POSIX thread-based
    synchronization between processors within a node
    and existing TreadMarks synchronization functions
    between nodes.
  • A second mapping was added for the memory that is
    shared between nodes so shared-memory pages can
    be updated while the first mapping remains
    invalid until the update is complete. This
    reduces the number of page protection operations
    performed by TreadMarks.

15
OpenMP Translator
  • Synchronization directives translate directly to
    TreadMarks synchronization operations.
  • The complier translates the code sections marks
    with parallel directives to fork-join code.
  • Data environment directives implemented to work
    with both TreadMarks and POSIX threads, hiding
    the interface issues from the programmer.

16
Performance Measurement
  • Platform
  • IBM SP2 consisting of four SMP nodes
  • Per node
  • Four IBM PowerPC 604 processors
  • 1 GB memory
  • Running AIX 4.2

17
Performance Measurement
  • Applications
  • SPLASH-2 Barnes-Hut
  • NAS 3D-FFT
  • SPLASH-2 CLU
  • SPLASH-2 Water
  • Red-Black SOR
  • TSP
  • Modified Gramm-Schmidt (MGS)

18
Results
19
Results
20
Results
21
Results
22
Conclusion
  • Enables the programmer to rely on a single,
    standard, shared-memory API for parallelization
    within and between multiprocessors.
  • Using shared hardware memory reduced data and
    messages transmitted.
  • The speedups of multithreaded TreadMarks codes on
    four four-way SMP SP2 nodes are within 7-30 of
    the MPI versions.

23
Critique
  • Solution allows easier implementation of program
    parallelization across multiprocessors if speedup
    is not crucial
  • OpenMP is easier on the programmer but speedup
    still not as good as MPI

24
Critique
  • Issues
  • AIX has inefficient implementation of page
    protection
  • Paper claims that every other brand of Unix,
    including Linux, uses data structures that handle
    mprotect operations more efficiently
  • Why wasnt the solution implemented on another
    platform?
  • Paper failed to present a big motivation for
    using this solution over MPI.

25
Thank You
Write a Comment
User Comments (0)
About PowerShow.com