Computational Methods in Physics PHYS 3437 - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Methods in Physics PHYS 3437

Description:

Concepts what are parallel computers, what is parallel programming? ... FORTRAN - ampersand. necessary for continuation. A quick note ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 35
Provided by: RobTh6
Category:

less

Transcript and Presenter's Notes

Title: Computational Methods in Physics PHYS 3437


1
Computational Methods in Physics PHYS 3437
  • Dr Rob Thacker
  • Dept of Astronomy Physics (MM-301C)
  • thacker_at_ap.smu.ca

2
Todays Lecture
  • Introduction to parallel programming
  • Concepts what are parallel computers, what is
    parallel programming?
  • Why do you need to use parallel programming?
  • When parallelism will be beneficial
  • Amdahls Law
  • Very brief introduction to OpenMP

3
Why bother to teach this in an undergrad Physics
course?
  • Because parallel computing is now ubiquitous
  • Most laptops are parallel computers, for example
  • Dual/Quad core chips are already standard, in the
    future we can look forward to 8/16/32/64 cores
    per chip!
  • Actually Sun Microsystems already sells a chip
    with 8 cores
  • I predict that by 2012 you will be buying chips
    with 16 cores
  • If we want to use all this capacity, then we will
    need to run codes that can use more than one CPU
    core at a time
  • Such codes are said to be parallel
  • Exposure to these concepts will help you
    significantly if you want to go to grad school in
    an area that uses computational methods
    extensively
  • Because not many people have these skills!
  • If you are interested, an excellent essay on how
    computing is changing can be found here
  • http//view.eecs.berkeley.edu/wiki/Main_Page

4
Some caveats
  • In two lectures we cannot cover very much on
    parallel computing
  • We will concentrate on the simplest kind of
    parallel programming
  • Exposes some of the inherent problems
  • Still gives you useful increased performance
  • Remember, making a code run 10 times faster turns
    a week into a day!
  • The type of programming well be looking at is
    often limited in terms of the maximum speed-up
    possible, but factors of 10 are pretty common

5
Why cant the compiler just make my code parallel
for me?
  • In some situations it can, but most of the time
    it cant
  • You really are smarter than a compiler is!
  • There are many situations where a compiler will
    not be able to make something parallel but you
    can
  • Compilers that can attempt to parallelize code
    are called auto-parallelizing
  • Some people have suggested writing parallel
    languages that only allow the types of code that
    can be easily parallelized
  • These have proven to not be very popular and are
    too restrictive
  • At present, the most popular way of parallel
    programming is to add additional commands to your
    original code
  • These commands are sometimes called pragmas or
    directives

6
Recap von Neumann architecture
Machine instructions are encoded in binary
stored key insight!
  • First practical stored-program architecture
  • Still in use today
  • Speed is limited by the bandwidth of data between
    memory and processing unit
  • von Neumann bottleneck

MEMORY
DATA MEMORY
PROGRAM MEMORY
CONTROL UNIT
PROCESS UNIT
CPU
INPUT
OUTPUT
Developed while working on the EDVAC design
7
Shared memory computers
Program these computers using OpenMP extensions
to C,FORTRAN
CPU
CPU
CPU
CPU
MEMORY
Traditional shared memory design all processors
share a memory bus All of the processors see the
share the same memory locations. This means that
programming these computers is reasonably
straightforward. Sometimes called SMPs for
symmetric multi-processor.
8
Distributed memory computers
Program these computers using MPI or PVM
extensions to C, FORTRAN
NETWORK
CPU
CPU
CPU
CPU
MEMORY
MEMORY
MEMORY
MEMORY
Really a collection of computers linked together
via a network. Each processor has its own memory
and must communicate with other processors over
the network to get information from other memory
locations. This is really quite difficult at
times. This is the architecture of computer
clusters (you could actually have each CPU
here be a shared memory computer).
9
Parallel execution
  • What do we mean by being able to do things in
    parallel?
  • Suppose the input data of an operation is divided
    into series of independent parts
  • Processing of the parts is carried out
    independently
  • A simple example is operations on vectors/arrays
    where we loop over array indices

ARRAY A(i)
10
Some subtleties
  • However, you cant always do this
  • Consider
  • do i2,n
  • a(i)a(i-1)
  • end do
  • This kind of loop has what we call a dependence
  • If you update a value of a(i) before a(i-1) has
    been updated then you will get the wrong answer
    compared to running on a single processor
  • Well talk a little more about this later, but it
    does mean that not every loop can be
    parallelized

11
Issues to be aware of
  • Parallel computing is not about being cool and
    doing lots and lots of flops
  • Flops floating point operations per second
  • We want solutions to problems in a reasonable
    amount of time
  • Sometimes that means doing a lot of calculations
    e.g. consider what we found about the number of
    collisions for molecules in air
  • Gains from algorithmic improvements will often
    swamp hardware improvements
  • Dont be brain-limited, if there is a better
    algorithm use it

12
Algorithmic Improvements in n-body simulations
Improvements in the speed of algorithms are
proportionally better than the speed increase of
computers over the same time interval.
13
Identifying Performance Desires
Code Evolution timescale
Frequency of Use
Positive Precondition
Daily
Hundreds of executions between changes
Monthly
Changes each run
Yearly
Negative Precondition
14
Performance Characteristics
Level of Synchronization
Execution Time
Positive Precondition
Days
None
Hours
Infrequent (every minute)
Minutes
Frequent (many per second)
Negative Precondition
15
Data and Algorithm
Data structures
Algorithmic complexity
Positive Precondition
Simple
Regular, static
Complex
Irregular, dynamic
approximately the number of stages
Negative Precondition
16
Requirements
Positive Precondition
Must significantly increase resolution/length of
integration
Need a factor of 2 increase
Current resolution meets needs
Negative Precondition
17
How much speed-up can we achieve?
  • Some parts of a code cannot be run in parallel
  • For example the loop over a(i)a(i-1) from
    earlier
  • Any code that cannot be executed in parallel is
    said to be serial or sequential
  • Lets suppose in terms of the total execution time
    of a program a fraction fs has to be run in
    serial, while fp can be run in parallel on n cpus
  • Equivalently the time spent in each fraction will
    be ts and tp so the total time on 1 cpu is
    t1cputstp
  • If we can run the parallel fraction on n cpus
    then it will take a time tp/n
  • The total time will then be tncputstp/n

18
Amdahls Law
  • How much speed-up (Snt1cpu/tncpu) is feasible?
  • Amdahls Law is the most significant limit. Given
    our previous results and n processors, the
    maximum speed-up is given by
  • Only if the serial fraction fs(ts/(tstp)) is
    zero is perfect speed-up possible (at least in
    theory)

19
Amdahls Law
Speed-up
Ncpu
20
What is OpenMP?
  • OpenMP is a pragma based application
    programmer interface (API) that provides a
    simple extension to C/C and FORTRAN
  • Pragma is just a fancy word for instructions
  • It is exclusively designed for shared memory
    programming
  • Ultimately, OpenMP is a very simple interface to
    something called threads based programming
  • What actually happens when you break up a loop
    into pieces is that a number of threads of
    execution are created that can run the loop
    pieces in parallel

21
Threads based execution
  • Serial execution, interspersed with parallel

In practice many compilers block execution of the
extra threads during serial sections, this saves
the overhead of the fork-join operation
22
Some background to threads programming
  • There is actually an entire set of commands in C
    to allow you to create threads
  • You could, if you wanted, program with these
    commands
  • The most common thread standard is called POSIX
  • However, OpenMP provides a simple interface to a
    lot of the functionality provided by threads
  • If it is simple, and does what you need why
    bother going to the effort of using threads
    programming?

23
Components of OpenMP
Directives (Pragmas in your code)
Runtime Library Routines (Compiler)
Environment Variables (set at Unix prompt)
24
OpenMP Where did it come from?
  • Prior to 1997, vendors all had their own
    proprietary shared memory programming commands
  • Programs were not portable from one SMP to
    another
  • Researchers were calling for some kind of
    portability
  • ANSI X3H5 (1994) proposal tried to formalize a
    shared memory standard but ultimately failed
  • OpenMP (1997) worked because the vendors got
    behind it and there was new growth in the shared
    memory market place
  • Very hard for researchers to get new languages
    supported now, must have backing from computer
    vendors!

25
Bottomline
  • For OpenMP shared memory programming in
    general, one only has to worry about parallelism
    of work
  • This is because all the processors in a
    shared-memory computer can see all the same
    memory locations
  • On distributed-memory computers one has to worry
    both about parallelism of the work and also the
    placement of data
  • Is the value I need in the memory of another
    processor?
  • Data movement is what makes distributed-memory
    codes (usually written in something called MPI)
    so much longer it can be highly non-trivial
  • Although it can be easy it depends on the
    algorithm

26
First Steps
  • Loop level parallelism is the simplest and
    easiest way to use OpenMP
  • Take each do loop and make it parallel (if
    possible)
  • It allows you to slowly build up parallelism
    within your application
  • However, not all loops are immediately
    parallelizeable due to dependencies

27
Loop Level Parallelism
  • Consider the single precision vector add-multiply
    operation YaXY (SAXPY)

C/C
for (i1iltni) YiaXi
pragma omp parallel for \ private(i)
shared(X,Y,n,a) for (i1iltni)
YiaXi
28
In more detail
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
29
A quick note
  • To be fully lexically correct you may want to
    include an COMP END PARALLEL DO
  • In f90 programs use !OMP as a sentinel
  • Notice that the sentinels mean that the OpenMP
    commands look like comments
  • A compiler that has OpenMP compatibility turned
    on will see the comments after the sentinel
  • This means you can still compile the code on
    computers that dont have OpenMP

30
How the compiler handles OpenMP
  • When you compile an OpenMP code you need to add
    flags to the compile line, e.g.
  • f77 openmp o myprogram myprogram.f
  • Unfortunately different compilers have different
    commands for turning on OpenMP support, the above
    will work on Sun machines
  • When the compiler flag is turned on, you now
    force the compiler to link in all of the
    additional libraries (and so on) necessary to run
    the threads
  • This is all transparent to you though

31
Requirements for parallel loops
  • To divide up the work the compiler needs to know
    the number of iterations to be executed the
    trip count must be computable
  • They must also not exhibit any of the
    dependencies we mentioned
  • Well review this more in the next lecture
  • Actually a good test for dependencies is running
    the loop from n to 1, rather than 1 to n. If you
    get a different answer that suggests there are
    dependencies
  • DO WHILE is not parallelizable using these
    directives
  • There is actually a way of parallelizing DO WHILE
    using a different set of OpenMP commands, but we
    dont have time to cover that
  • The loop can only have one exit point therefore
    BREAK or GOTOs are not allowed

32
Performance limitations
  • Each time you start and end a parallel loop there
    is an overhead associated with the threads
  • These overheads must always be added to the time
    taken to calculate the loop itself
  • Therefore there is a limit on the smallest loop
    size that will achieve speed up
  • In practice, we need roughly 5000 floating point
    operations in a loop for it to be worth
    parallelizing
  • A good rule of thumb is that any thread should
    have at least 1000 floating point operations
  • Thus small loops are simply not worth the bother!

33
Summary
  • Shared memory parallel computers can be
    programmed using the OpenMP extensions to
    C,FORTRAN
  • Distributed memory computers require a different
    parallel language
  • The easiest way to use OpenMP is to make loops
    parallel by dividing work up among threads
  • Compiler handles most of the difficult parts of
    coding
  • However, not all loops are immediately
    parallelizable
  • Dependencies may prevent parallelization
  • Loops are made to run in parallel by adding
    directives (pragmas) to your code
  • These directives appear to be comments to
    ordinary compilers

34
Next Lecture
  • More details on dependencies and how we can deal
    with them
Write a Comment
User Comments (0)
About PowerShow.com