Introducci - PowerPoint PPT Presentation

Loading...

PPT – Introducci PowerPoint presentation | free to download - id: 728bda-MGI1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introducci

Description:

Introducci n a la computaci n de altas prestaciones Francisco Almeida y Francisco de Sande Departamento de Estad stica, I.O. y Computaci n Universidad de La Laguna – PowerPoint PPT presentation

Number of Views:1
Avg rating:3.0/5.0
Date added: 26 October 2018
Slides: 146
Provided by: Casia6
Learn more at: http://www.iac.es
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introducci


1
Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004
2
Questions
  • Why Parallel Computers?
  • How Can the Quality of the Algorithms be
    Analyzed?
  • How Should Parallel Computers Be Programmed?
  • Why the Message Passing Programming Paradigm?
  • Why de Shared Memory Programming Paradigm?

3
OUTLINE
  • Introduction to Parallel Computing
  • Performance Metrics
  • Models of Parallel Computers
  • The MPI Message Passing Library
  • Examples
  • The OpenMP Shared Memory Library
  • Examples
  • Improvements in black hole detection using
    parallelism

4
Why Parallel Computers ?
  • Applications Demanding more Computational Power
  • Artificial Intelligence
  • Weather Prediction
  • Biosphere Modeling
  • Processing of Large Amounts of Data (from sources
    such as satellites)
  • Combinatorial Optimization
  • Image Processing
  • Neural Network
  • Speech Recognition
  • Natural Language Understanding
  • etc..

1990 s
1980 s
1970 s
Performance
1960 s
Cost
SUPERCOMPUTERS
5
Top500
  • www.top500.org

6
Performace Metrics
7
Speed-up
  • Ts Sequential Run Time Time elapsed between
    the begining and the end of its execution on a
    sequential computer.
  • Tp Parallel Run Time Time that elapses from
    the moment that a parallel computation starts to
    the moment that the last processor finishes the
    execution.
  • Speed-up Ts / Tp ? p
  • Ts Time of the best sequential algorithm to
    solve the problem.

8
Speed-up
9
Speed-up
10
Speed-up
11
Speed-up
Optimal Number of Processors
12
Efficiency
  • In practice, ideal behavior of an speed-up equal
    to p is not achieved because while executing a
    parallel algorithm, the processing elements
    cannot devote 100 of their time to the
    computations of the algorithm.
  • Efficiency Measure of the fraction of time for
    which a processing element is usefully employed.
  • E (Speed-up / p) x 100

13
Efficiency
14
Amdahls Law
  • Amdahls law attempt to give a maximum bound for
    speed-up from the nature of the algorithm chosen
    for the parallel implementation.
  • Seq Proportion of time the algorithm needs to
    be spent in purely sequential parts.
  • Par Proportion of time that might be done in
    parallel
  • Seq Par 1 (where 1 is for algebraic
    simplicity)
  • Maximum Speed-up (Seq Par) / (Seq Par / p)
    1 / (Seq Par / p)

Seq Par Maximum Speed-up
0,1 0,001 0,999 500,25
0,5 0,005 0,995 166,81
1 0,01 0,99 90,99
10 0,1 0,9 9,91
p 1000
15
Example
  • A problem to be solved many times over several
    different inputs.
  • Evaluate F(x,y,z)
  • x in 1 , ..., 20 y in 1 , ..., 10 z in 1 ,
    ..., 3
  • The total number of evaluations is 20103 600.
  • The cost to evaluate F in one point (x, y, z) is
    t.
  • The total running time is t 600.
  • If t is equal to 3 hours.
  • The total running time for the 600 evaluations is
    1800 hours ? 75 days

16
Speed-up
17
Models
18
The Sequential Model
  • The RAM model express computations on von Neumann
    architectures.
  • The von Neumann architecture is universally
    accepted for sequential computations.

RAM
Von Neumann
19
The Parallel Model
PRAM
  • Computational Models
  • Programming Models
  • Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
20
Address-Space Organization
21
Digital AlphaServer 8400
Hardware
  • Shared Memory
  • BusTopology
  • C4-CEPBA
  • 10 Alpha processors21164
  • 2 Gb Memory
  • 8,8 Gflop/s

22
SGI Origin 2000
Hardware
  • Shared Dsitributed Memory
  • Hypercubic Topology
  • C4-CEPBA
  • 64 R1000 processos
  • 8 Gb memory
  • 32 Gflop/s

23
The SGI Origin 3000 Architecture (1/2)
  • jen50.ciemat.es
  • 160 processors MIPS R14000 / 600MHz
  • On 40 nodes with 4 processors each
  • Data and instruction cache on-chip
  • Irix Operating System
  • Hypercubic Network

24
The SGI Origin 3000 Architecture (2/2)
  • cc-Numa memory Architecture
  • 1 Gflops Peak Speed
  • 8 MB external cache
  • 1 Gb main memory each proc.
  • 1 Tb Hard Disk

25
Beowulf Computers
  • COTS Commercial-Off-The-Shelf computers
  • Distributed Memory

26
Towards Grid Computing.
Source www.globus.org updated
27
The Parallel Model
PRAM
  • Computational Models
  • Programming Models
  • Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OpenMP
Parallel Architectures
28
Drawbacks that arise when solving Problems using
Parallelism
  • Parallel Programming is more complex than
    sequential.
  • Results may vary as a consequence of the
    intrinsic non determinism.
  • New problems. Deadlocks, starvation...
  • Is more difficult to debug parallel programs.
  • Parallel programs are less portable.

29
The Message Passing Model
30
The Message Passing Model
Send(parameters) Recv(parameters)
31
(No Transcript)
32
MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
33
MPI
  • What Is MPI?
  • Message Passing Interface standard
  • The first standard and portable message
    passing library with good performance
  • "Standard" by consensus of MPI Forum
    participants from over 40 organizations
  • Finished and published in May 1994, updated
    in June 1995
  • What does MPI offer?
  • Standardization - on many levels
  • Portability - to existing and new systems
  • Performance - comparable to vendors'
    proprietary libraries
  • Richness - extensive functionality, many
    quality implementations

34
A Simple MPI Program
MPI hello.c include ltstdio.hgt include
"mpi.h" main(int argc, charargv) int name,
p, source, dest, tag 0 MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD,name)
MPI_Comm_size(MPI_COMM_WORLD,p) printf(Hello
from processor d of d\n",name, p)
MPI_Finalize()

35
A Simple MPI Program
MPI helloms.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)

if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()

36
Linear Model to Predict Communication Performance
  • Time to send N bytes ? n b

37
Performace Prediction Fast, Gigabit Ethernet,
Myrinet
38
Basic Communication Operations
39
One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
40
Broadcast on Hypercubes
41
Broadcast on Hypercubes
42
MPI Broadcast
  • int MPI_Bcast(
  • void buffer
  • int count
  • MPI_Datatype datatype
  • int root
  • MPI_Comm comm
  • )
  • Broadcasts a message from the
  • process with rank "root" to
  • all other processes of the group

43
Reduction on Hypercubes
A6 110
  • _at_ conmutative and associative operator
  • Ai in processor i
  • Every processor has to obtain A0_at_A1_at_..._at_AP-1

A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
44
Reductions with MPI
  • int MPI_Reduce(
  • void sendbufvoid recvbufint
    countMPI_Datatype datatypeMPI_Op opint
    rootMPI_Comm comm)
  • Reduces values on all processes to a single value
    processes
  • int MPI_Allreduce(
  • void sendbufvoid recvbufint
    countMPI_Datatype datatypeMPI_Op opMPI_Comm
    comm)
  • Combines values form all processes and
    distributes the result back to all

45
All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
46
MPI Collective Operations
  • MPI Operator Operation
  • ----------------------------------------
    -----------------------
  • MPI_MAX maximum
  • MPI_MIN minimum
  • MPI_SUM sum
  • MPI_PROD product
  • MPI_LAND logical and
  • MPI_BAND bitwise and
  • MPI_LOR logical or
  • MPI_BOR bitwise or
  • MPI_LXOR logical
    exclusive or
  • MPI_BXOR bitwise
    exclusive or
  • MPI_MAXLOC max value and
    location
  • MPI_MINLOC min value and
    location

47
Computing ? Sequential
double t, pi0.0, w long i, n ... double
local, pi 0.0 .... h 1.0 / (double) n
for (i 0 i lt n i) x (i 0.5)
h pi f(x) pi h
4
2
0.0
0.2
0.4
0.6
0.8
1.0
48
Computing ? Parallel
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i name i lt n i
numprocs) x h (i 0.5) h
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
49
The Master Slave Paradigm
Master
Slaves
50
Condor University Wisconsin-Madison.
www.cs.wisc.edu/condor
  • A problem to be solved many times over several
    different inputs.
  • The problem to be solved is computational
    expensive.

51
Condor
  • Condor is a specialized workload management
    system for compute-intensive jobs.
  • Like other full-featured batch systems, Condor
    provides a job queueing mechanism, scheduling
    policy, priority scheme, resource monitoring, and
    resource management.
  • Users submit their serial or parallel jobs to
    Condor, Condor places them into a queue, chooses
    when and where to run the jobs based upon a
    policy, carefully monitors their progress, and
    ultimately informs the user upon completion.
  • Condor can be used to manage a cluster of
    dedicated compute nodes (such as a "Beowulf"
    cluster). In addition, unique mechanisms enable
    Condor to effectively harness wasted CPU power
    from otherwise idle desktop workstations.
  • In many circumstances Condor is able to
    transparently produce a checkpoint and migrate a
    job to a different machine which would otherwise
    be idle.
  • As a result, Condor can be used to seamlessly
    combine all of an organization's computational
    power into one resource.

52
What is OpenMP?
  • Application Program Interface (API) for shared
    memory parallel programming
  • What the application programmer inserts into code
    to make it run in parallel
  • Addresses only shared memory multiprocessors
  • Directive based approach with library support
  • Concept of base language and extensions to base
    language
  • OpenMP is available for Fortran 90 / 77 and C /
    C

53
OpenMP is not...
  • Not Automatic parallelization
  • User explicitly specifies parallel execution
  • Compiler does not ignore user directives even if
    wrong
  • Not just loop level parallelism
  • Functionality to enable coarse grained
    parallelism
  • Not a research project
  • Only practical constructs that can be implemented
    with high performance in commercial compilers
  • Goal of parallel programming application speedup
  • Simple/Minimal with Opportunities for
    Extensibility

54
Why OpenMP?
  • Parallel programming landscape before OpenMP
  • Standard way to program distributed memory
    computers (MPI and PVM)
  • No standard API for shared memory programming
  • Several vendors had directive based APIs for
    shared memory programming
  • Silicon Graphics, Cray Research, Kuck
    Associates, Digital Equipment .
  • All different, vendor proprietary
  • Sometimes similar but with different spellings
  • Most were targeted at loop level parallelism
  • Limited functionality - mostly for parallelizing
    loops

55
Why OpenMP? (cont.)
  • Commercial users, high end software vendors have
    big investment in existing code
  • Not very eager to rewrite their code in new
    languages
  • Performance concerns of new languages
  • End result users who want portability forced to
    program shared memory machines using MPI
  • Library based, good performance and scalability
  • But sacrifice the built in shared memory
    advantages of the hardware
  • Both require major investment in time and money
  • Major effort entire program needs to be
    rewritten
  • New features needs to be curtailed during
    conversion

56
The OpenMP API
  • Multi-platform shared-memory parallel programming
  • OpenMP is portable supported by Compaq, HP, IBM,
    Intel, SGI, Sun and others on Unix and NT
  • Multi-Language C, C, F77, F90
  • Scalable
  • Loop Level parallel control
  • Coarse grained parallel control

57
The OpenMP API
  • Single source parallel/serial programming
  • OpenMP is not intrusive (to original serial
    code).
  • Instructions appear in comment statements for
    Fortran and through pragmas for C/C
  • Incremental implementation
  • OpenMP programs can be implemented incrementally
    one subroutine (function) or even one do (for)
    loop at a time

!omp parallel do do i 1, n ... enddo
pragma omp parallel for for (i 0 I lt n
i) ...
58
Threads
  • Multithreading
  • Sharing a single CPU between multiple tasks (or
    "threads") in a way designed to minimise the time
    required to switch threads
  • This is accomplished by sharing as much as
    possible of the program execution environment
    between the different threads so that very little
    state needs to be saved and restored when
    changing thread.
  • Threads share more of their environment with each
    other than do tasks under multitasking.
  • Threads may be distinguished only by the value of
    their program counters and stack pointers while
    sharing a single address space and set of global
    variables

59
OpenMP Overview How do threads interact?
  • OpenMP is a shared memory model
  • Threads communicate by sharing variables
  • Unintended sharing of data causes race
    conditions
  • race condition when the programs outcome
    changes as the threads are scheduled differently
  • To control race conditions
  • Use synchronizations to protect data conflicts
  • Synchronization is expensive so
  • Change how data is accessed to minimize the need
    for synchronization

60
OpenMP Parallel Computing Solution Stack
User layer
Environment Variables
Prog. Layer (OpenMP API)
Directives
OpenMP Library
Runtime Library
System layer
OS/system support for shared memory
61
Reasoning about programming
  • Programming is a process of successive refinement
    of a solution relative to a hierarchy of models
  • The models represent the problem at a different
    level of abstraction
  • The top levels express the problem in the
    original problem domain
  • The lower levels represent the problem in the
    computers domain
  • The models are informal, but detailed enough to
    support simulation
  • Source J.-M. Hoc, T.R.G. Green, R. Samurcay and
    D.J. Gilmore (eds.), Psychology of Programming,
    Academic Press Ltd., 1990

62
Layers of abstraction in Programming
Domain
Model Bridges between domains
Problem
Algorithm
OpenMP only defines these two!
Source Code
Computation
Cost
Hardware
63
The OpenMP Computational Model
  • OpenMP was created with a particular abstract
    machine or computational model in mind
  • Multiple processing elements
  • A shared address space with equal-time access
    for each processor
  • Multiple light weight processes (threads) managed
    outside of OpenMP (the OS or some other third
    party)

64
The OpenMP programming model
  • fork-join parallelism
  • Master thread spawns a team of threads as needed
  • Parallelism is added incrementally i.e. the
    sequential program evolves into a parallel program

Master Thread
A nested Parallel Region
Parallel Regions
65
So, How good is OpenMP?
  • A high quality programming environment supports
    transparent mapping between models
  • OpenMP does this quite well for the models it
    defines
  • Programming model
  • Threads forked by OpenMP map onto threads in
    modern OSs.
  • Computational model
  • Multiprocessor systems with cache coherency map
    onto OpenMP shared address space

66
What about the cost model?
  • OpenMP doesnt say much about the cost model
  • programmers are left to their own devices
  • Real systems have memory hierarchies, OpenMPs
    assumed machine model doesnt
  • Caches mean some data is closer to some
    processors
  • Scalable multiprocessor systems organize their
    RAM into modules - another source of NUMA
  • OpenMP programmers must deal with these issues as
    they
  • Optimize performance on each platform
  • Scale to run onto larger NUMA machines

67
What about the specification model?
  • Programmers reason in terms of a specification
    model as they design parallel algorithms
  • Some parallel algorithms are natural in OpenMP
  • Specification models implied by loop-splitting
    and SPMD algorithms map well onto OpenMPs
    programming model
  • Some parallel algorithms are hard for OpenMP
  • Recursive problems and list processing is
    challenging for OpenMPs models

68
Is OpenMP a good API?
Model Bridges between domains
Fair (5)
Good (8)
Good (9)
Poor (3)
Overall score OpenMP is OK (6), but it sure
could be better!
69
OpenMP today
  • Hardware Vendors
  • Compaq/Digital (DEC)
  • Hewlett-Packard (HP)
  • IBM
  • Intel
  • Silicon Graphics
  • Sun Microsystems

3rd Party Software Vendors Absoft Edinburgh
Portable Compilers (EPC) Kuck Associates
(KAI) Myrias Numerical Algorithms Group
(NAG) Portland Group (PGI)
70
The OpenMP components
  • Directives
  • Environment Variables
  • Shared / Private Variables
  • Runtime Library
  • OS Threads

71
The OpenMP directives
  • Directives are special comments in the language
  • Fortran fixed form !OMP, COMP, OMP
  • Fortran free form !OMP
  • Special comments are interpreted by OpenMP
    compilers

w 1.0/n sum 0.0 !OMP PARALLEL DO
PRIVATE(x) REDUCTION(sum) do I1,n
x w(I-0.5) sum sum f(x) end
do pi wsum print ,pi end
Comment in Fortran but interpreted by
OpenMP compilers
Dont worry about details now!
72
The OpenMP directives
  • Look like a comment (sentinel / pragma syntax)
  • F77/F90 !OMP directive_name clauses
  • C/C pragma omp pragmas_name clauses
  • Declare start and end of multithread execution
  • Control work distribution
  • Control how data is brought into and taken out of
    parallel sections
  • Control how data is written/read inside sections

73
The OpenMP environment variables
  • OMP_NUM_THREADS - number of threads to run in a
    parallel section
  • MPSTKZ size of stack to provide for each thread
  • OMP_SCHEDULE - Control state of scheduled
    executions.
  • setenv OMP_SCHEDULE "STATIC, 5
  • setenv OMP_SCHEDULE "GUIDED, 8
  • setenv OMP_SCHEDULE "DYNAMIC"
  • OMP_DYNAMIC
  • OMP_NESTED

74
Shared / Private variables
  • Shared variables can be accessed by all of the
    threads.
  • Private variables are local to each thread
  • In a typical parallel loop, the loop index is
    private, while the data being indexed is shared

!omp parallel !omp parallel do !omp
shared(X,Y,Z), private(I) do I1, 100
Z(I) X(I) Y(I) end do !omp end parallel
75
OpenMP Runtime routines
  • Writing a parallel section of code is matter of
    asking two questions
  • How many threads are working in this section?
  • Which thread am I?
  • Other things you may wish to know
  • How many processors are there?
  • Am I in a parallel section?
  • How do I control the number of threads?
  • Control the execution by setting directive state

76
Os threads
  • In the case of Linux, it needs to be installed
    with an SMP kernel.
  • Not a good idea to assign more threads than CPUs
    available
  • omp_set_num_threads(omp_get_num_procs())

77
A simple example computing ?
78
Computing ?
double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... for(i 0 i lt n i) t (i 0.5)
w pi 4.0/(1.0 tt) pi w ...
79
Computing ?
  • pragma omp directives in C
  • Ignored by non-OpenMP compilers

double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... pragma omp parallel for reduction(pi)
private(i,t) for(i 0 i lt n i) t (i
0.5) w pi 4.0/(1.0 tt) pi w ...
80
Computing ? on a SunFire 6800
81
Compiling OpenMP programs
  • OpenMP directives are ignored by default
  • Example SGI Irix platforms
  • f90 -O3 foo.f
  • cc -O3 foo.c
  • OpenMP directives are enabled with -mp
  • Example SGI Irix platforms
  • f90 -O3 -mp foo.f
  • cc -O3 -mp foo.c

82
Fortran example
  • OpenMP directives used
  • comp parallel clauses
  • comp end parallel
  • program f77_parallel
  • implicit none
  • integer n, m, i, j
  • parameter (n10, m20)
  • integer a(n,m)
  • comp parallel default(none)
  • comp private(i,j) shared(a)
  • do j1,m
  • do i1,n
  • a(i,j)i(j-1)n
  • enddo
  • enddo
  • comp end parallel
  • end
  • Parallel clauses include
  • default(noneprivateshared)
  • private(...)
  • shared(...)

83
Fortran example
  • program f77_parallel
  • implicit none
  • integer n, m, i, j
  • parameter (n10, m20)
  • integer a(n,m)
  • comp parallel default(none)
  • comp private(i,j) shared(a)
  • do j1,m
  • do i1,n
  • a(i,j)i(j-1)n
  • enddo
  • enddo
  • comp end parallel
  • end
  • Each arrow denotes one thread
  • All threads perform identical task

84
Fortran example
  • With default scheduling,
  • Thread a works on j 15
  • Thread b on j 610
  • Thread c on j 1115
  • Thread d on j 1620
  • program f77_parallel
  • implicit none
  • integer n, m, i, j
  • parameter (n10, m20)
  • integer a(n,m)
  • comp parallel default(none)
  • comp private(i,j) shared(a)
  • do j1,m
  • do i1,n
  • a(i,j)i(j-1)n
  • enddo
  • enddo
  • comp end parallel
  • end

a
b
c
d
85
The Foundations of OpenMP
Layers of abstractions or models used to
understand and use OpenMP
86
Summary of OpenMP Basics
  • Parallel Region (to create threads)
  • Comp parallel pragma omp parallel
  • Worksharing (to split-up work between threads)
  • Comp do pragma omp for
  • Comp sections pragma omp sections
  • Comp single pragma omp single
  • Comp workshare
  • Data Environment (to manage data sharing)
  • directive threadprivate
  • clauses shared, private, lastprivate,
    reduction, copyin, copyprivate
  • Synchronization
  • directives critical, barrier, atomic, flush,
    ordered, master
  • Runtime functions/environment variables

87
Improvements in black hole detection using
parallelism
Francisco Almeida1, Evencio Mediavilla2, Álex
Oscoz2 and Francisco de Sande1
1 Departamento de Estadística, I.O. y
Computación
2 Instituto de Astrofísica de
Canarias
Universidad de La Laguna Tenerife, Canary
Islands, SPAIN Dresden, September 3
2003
88
Introduction (1/3)
  • Very frequently there is a divorce between
    computer scientists and researchers in other
    scientific disciplines
  • This work collects the experiences of a
    collaboration between researchers coming from two
    different fields astrophysics and parallel
    computing
  • We present different approaches to the
    parallelization of a scientific code that solves
    an important problem in astrophysics
  • the detection of supermassive black holes

89
Introduction (2/3)
  • The IAC co-authors initially developed a
    Fortran77 code solving the problem
  • The execution time for this original code was not
    acceptable and that motivated them to contact
    with researchers with expertise in the parallel
    computing field
  • We know in advance that these scientist
    programmers deal with intense time-consuming
    sequential codes that are not difficult to tackle
    using HPC techniques
  • Researchers with a purely scientific background
    are interested in these techniques, but they are
    not willing to spend time learning about them

90
Introduction (3/3)
  • One of our constraints was to introduce the
    minimum amount of changes in the original code
  • Even with the knowledge that some optimizations
    could be done in the sequential code
  • To preserve the use of the NAG functions was also
    another restriction in our development

91
Outline
  • The Problem
  • Black holes and quasars
  • The method gravitational lensing
  • Fluctuations in quasars light curves
  • Mathematical formulation of the problem
  • Sequential code
  • Parallelizations MPI, OpenMP, Mixed MPI/OpenMP
  • Computational results
  • Conclusions

92
Black holes
  • Supermassive black holes (SMBH) are supposed to
    exist in the nucleus of many if not all the
    galaxies
  • Some of these objects are surrounded by a disk of
    material continuously spiraling towards the deep
    gravitational potential pit of the SMBH and
    releasing huge quantities of energy giving rise
    to the phenomena known as quasars (QSO)

93
The accretion disk
94
Quasars (Quasi Stellar Objects, QSO)
  • QSOs are currently believed to be the most
    luminous and distant objects in the universe
  • QSOs are the cores of massive galaxies with super
    giant black holes that devour stars at a rapid
    rate, enough to produce the amount of energy
    observed by a telescope

95
The method
  • We are interested in objects of dimensions
    comparable to the Solar System in galaxies very
    far away from the Milky Way
  • Objects of this size can not be directly imaged
    and alternative observational methods are used to
    study their structure
  • The method we use is the observation of QSO
    images affected by a microlensing event to study
    the structure of the accretion disk

96
Gravitational Microlensing
  • Gravitational lensing (the attraction of light by
    matter) was predicted by General Relativity and
    observationally confirmed in 1919
  • If light from a QSO pass through a galaxy located
    between the QSO and the observer it is possible
    that a star in the intervening galaxy crosses the
    QSO light beam
  • The gravitational field of the star amplifies the
    light emission coming from the accretion disk
    (gravitational microlensing)

97
Microlensing Quasar-Star
98
Microlensing
  • The phenomenon is more complex because the
    magnification is not due to a single isolated
    microlens, but it rather is a collective effect
    of many stars
  • As the stars are moving with respect to the QSO
    light beam, the amplification varies during the
    crossing

99
Microlensing
100
Double Microlens
101
Multiple Microlens
  • Magnification pattern in the source plane,
    produced by a dense field of stars in the lensing
    galaxy.
  • The color reflects the magnification as a
    function of the quasar position the sequence
    blue-green-red-yellow indicates increasing
    magnification

102
Q22370305 (1/2)
  • So far the best example of a microlensed quasar
    is the quadruple quasar Q22370305

103
Q22370305 (2/2)
104
Fluctuations in Q22370305 light curves
  • Lightcurves of the four images of Q22370305 over
    a period of almost ten years
  • The changes in relative brightness are very
    obvious

105
Q22370305
  • In Q22370305, and thanks to the unusually small
    distance between the observer and the lensing
    galaxy, the microlensing events have a timescale
    of the order of months
  • We have observed Q22370305 from 1999 October
    during approximately 4 months at the Roque de los
    Muchachos Observatory

106
Fluctuations in light curves
  • The curve representing the change in luminosity
    of the QSO with time depends on the position of
    the star and also on the structure of the
    accretion disk
  • Microlens-induced fluctuations in the observed
    brightness of the quasar contain information
    about the light-emitting source (size of
    continuum region or broad line region of the
    quasar, its brightness profile, etc.)
  • Hence from a comparison between observed and
    simulated quasar microlensing we can draw
    conclusions about the accretion disk

107
Q22370305 light curves (2/2)
  • Our goal is to model light curves of QSO images
    affected by a microlensing event to study the
    unresolved structure of the accretion disk

108
Mathematical formulation (1/5)
  • Leaving aside the physical meaning of the
    different variables, the function modeling the
    dependence of the observed flux with time
    t, can be written as

Where is the ratio between the outer and
inner radii of the accretion disk (we will adopt
).
109
Mathematical formulation (2/5)
  • And G is the function
  • To speed up the computation, G has been
    approximated by using MATHEMATICA

110
Mathematical formulation (3/5)
  • Our goal is to estimate in the observed flux

the values of the parameters
by fitting to the observational data
111
Mathematical formulation (4/5)
  • Specifically, to find the values of the 5
    parameters that minimize the error between the
    theoretical model and the observational data
    according to a chi-square criterion

Where
  • N is the number of data points
    corresponding
  • to times ti (i 1, 2, ..., N)
  • is the theoretical function evaluated
    at time ti
  • is the observational error associated
    to each data value

112
Mathematical formulation (5/5)
  • The determination of the minimum in the
    5-parameters space depends on the initial
    conditions, so
  • we consider a 5-dimensional grid of starting
    points
  • and m sampling intervals in each variable
  • for each one of the points of the this grid we
    compute a local minimum
  • Finally, we select the absolute minimum among them
  • To minimize we use the e04ccf NAG routine,
    that only requires evaluation of the function and
    not of the derivatives

113
The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
114
The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
115
Sequential Times
  • In a Sun Blade 100 Workstation running Solaris
    5.0 and using the native Fortran77 Sun compiler
    (v. 5.0) with full optimizations this code takes
  • 5.89 hours for sampling intervals of size m4
  • 12.45 hours for sampling intervals of size m5
  • In a SGI Origin 3000 using the native MIPSpro F77
    compiler (v. 7.4) with full optimizations
  • 0.91 hours for m4
  • 2.74 hours for m5

116
Loop transformation
program seq_black_hole2 implicit
none parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5) integer k1,
k2, k3, k4, k5 common/var/t, fit c
Data input c Initialize best solution
do k 1, m5 c Index transformation c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo end seq_black_hole
117
MPI OpenMP (1/2)
  • In the last years OpenMP and MPI have been
    universally accepted as the standard tools to
    develop parallel applications
  • OpenMP
  • is a standard for shared memory programming
  • It uses a fork-join model and
  • is mainly based on compiler directives that are
    added to the code that indicate the compiler
    regions of code to be executed in parallel
  • MPI
  • Uses an SPMD model
  • Processes can read and write only to their
    respective local memory
  • Data are copied across local memories using
    subroutine calls
  • The MPI standard defines the set of functions and
    procedures available to the programmer

118
MPI OpenMP (2/2)
  • Each one of these two alternatives have both
    advantages and disadvantages
  • Very frequently it is not obvious which one
    should be selected for a specific code
  • MPI programs run on both distributed and shared
    memory architectures while OpenMP run only on
    shared memory
  • The abstraction level is higher in OpenMP
  • MPI is particularly adaptable to coarse grain
    parallelism. OpenMP is suitable for both
    coarse and fine grain parallelism
  • While it is easy to obtain a parallel version of
    a sequential code in OpenMP, usually it requires
    a higher level of expertise in the case of MPI. A
    parallel code can be written in OpenMP just by
    inserting proper directives in a sequential code,
    preserving its semantics, while in MPI mayor
    changes are usually needed
  • Portability is higher in MPI

119
MPI-OpenMP hybridization (1/2)
  • Hybrid codes match the architecture of SMP
    clusters
  • SMP clusters are an increasingly popular
    platforms
  • MPI may suffer from efficiency problems in shared
    memory architectures
  • MPI codes may need too much memory
  • Some vendors have attempted to exted MPI in
    shared memory, but the result is not as efficient
    as OpenMP
  • OpenMP is easy to use but it is limited to the
    shared memory architecture

120
MPI-OpenMP hybridization (2/2)
  • An hybrid code may provide better scalability
  • Or simply enable a problem to exploit more
    processors
  • Not necessarily faster than pure MPI/OpenMP
  • It depends on the code, architecture, and how the
    programming models interact

121
MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
122
MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
123
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
124
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
125
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
126
Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
127
Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
128
The SGI Origin 3000 Architecture (1/2)
  • jen50.ciemat.es
  • 160 processors MIPS R14000 / 600MHz
  • On 40 nodes with 4 processors each
  • Data and instruction cache on-chip
  • Irix Operating System
  • Hypercubic Network

129
The SGI Origin 3000 Architecture (2/2)
  • cc-Numa memory Architecture
  • 1 Gflops Peak Speed
  • 8 MB external cache
  • 1 Gb main memory each proc.
  • 1 Tb Hard Disk

130
Computational results (m4)
  • As we do not have exclusive mode access to the
    architecture, the times correspond to the minimum
    time from 5 different executions
  • The figures corresponding to the mixed mode code
    (label MPI-OpenMP) correspond to the minimum
    times obtained for different combinations of MPI
    processes/OpenMP threads

131
Parallel execution time (m4)
132
Speedup (m4)
133
Results for mixed MPI-OpenMP
134
Computational results (m5)
135
Parallel execution time (m5)
136
Speedup (m5)
137
Results for mixed MPI-OpenMP (m5)
138
Conclusions (1/3)
  • The computational results obtained from all the
    parallel versions confirm the robustness of the
    method
  • For the case of non-expert users and the kind of
    codes we have been dealing with, we believe that
    MPI parallel versions are easier and safer
  • In the case of OpenMP, the proper usage of data
    scope attributes for the variables involved may
    be a handicap for users with non-parallel
    programming expertise
  • The higher current portability of the MPI version
    is another factor to be considered

139
Conclusions (2/3)
  • The mixed MPI/OpenMP parallel version is the most
    expertise-demanding
  • Nevertheless, as it has been stated by several
    authors and even for the case of a hybrid
    architecture, this version does not offer any
    clear advantage and it has the disadvantage that
    the combination of processes/threads has to be
    tuned

140
Conclusions (3/3)
  • We conclude a first step of cooperation in the
    way of applying HPC techniques to improve
    performance in astrophysics codes
  • The scientific aim of applying HPC to
    computationally-intensive codes in astrophysics
    has been successfully achieved

141
Conclusions and Future Work
  • The relevance of our results do not come directly
    from the particular application chosen, but from
    stating that parallel computing techniques are
    the key to broach large scale real problems in
    the mentioned scientific field
  • From now on we plan to continue this fruitful
    collaboration by applying parallel computing
    techniques to some other astrophysical challenge
    problems

142
Supercomputing Centers
  • http//www.ciemat.es/
  • http//www.cepba.upc.es/
  • http//www.epcc.ed.ac.uk/

143
MPI links
  • Message Passing Interface Forum
  • http//www.mpi-forum.org/
  • MPI The Complete Reference
  • http//www.netlib.org/utk/papers/mpi-book/mpi-boo
    k.html
  • Parallel Programming with MPI By Peter Pacheco.
    http//www.cs.usfca.edu/mpi/

144
OpenMP links
  • http//www.openmp.org/
  • http//www.compunity.org/

145
Gracias!
Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande (falmeida,
fsande)_at_ull.es
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004
About PowerShow.com