Introducci

About This Presentation

Title:

Introducci

Description:

Introducci n a la computaci n de altas prestaciones Francisco Almeida y Francisco de Sande Departamento de Estad stica, I.O. y Computaci n Universidad de La Laguna – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 146

Provided by: Casia6

Category:

more less

Transcript and Presenter's Notes

Title: Introducci

1
Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004
2
Questions

Why Parallel Computers?
How Can the Quality of the Algorithms be
Analyzed?
How Should Parallel Computers Be Programmed?
Why the Message Passing Programming Paradigm?
Why de Shared Memory Programming Paradigm?

3
OUTLINE

Introduction to Parallel Computing
Performance Metrics
Models of Parallel Computers
The MPI Message Passing Library
Examples
The OpenMP Shared Memory Library
Examples
Improvements in black hole detection using
parallelism

4
Why Parallel Computers ?

Applications Demanding more Computational Power
Artificial Intelligence
Weather Prediction
Biosphere Modeling
Processing of Large Amounts of Data (from sources
such as satellites)
Combinatorial Optimization
Image Processing
Neural Network
Speech Recognition
Natural Language Understanding
etc..

1990 s
1980 s
1970 s
Performance
1960 s
Cost
SUPERCOMPUTERS
5
Top500

www.top500.org

6
Performace Metrics
7
Speed-up

Ts Sequential Run Time Time elapsed between
the begining and the end of its execution on a
sequential computer.
Tp Parallel Run Time Time that elapses from
the moment that a parallel computation starts to
the moment that the last processor finishes the
execution.
Speed-up Ts / Tp ? p
Ts Time of the best sequential algorithm to
solve the problem.

8
Speed-up
9
Speed-up
10
Speed-up
11
Speed-up
Optimal Number of Processors
12
Efficiency

In practice, ideal behavior of an speed-up equal
to p is not achieved because while executing a
parallel algorithm, the processing elements
cannot devote 100 of their time to the
computations of the algorithm.
Efficiency Measure of the fraction of time for
which a processing element is usefully employed.
E (Speed-up / p) x 100

13
Efficiency
14
Amdahls Law

Amdahls law attempt to give a maximum bound for
speed-up from the nature of the algorithm chosen
for the parallel implementation.
Seq Proportion of time the algorithm needs to
be spent in purely sequential parts.
Par Proportion of time that might be done in
parallel
Seq Par 1 (where 1 is for algebraic
simplicity)
Maximum Speed-up (Seq Par) / (Seq Par / p)
1 / (Seq Par / p)

Seq Par Maximum Speed-up
0,1 0,001 0,999 500,25
0,5 0,005 0,995 166,81
1 0,01 0,99 90,99
10 0,1 0,9 9,91
p 1000
15
Example

A problem to be solved many times over several
different inputs.
Evaluate F(x,y,z)
x in 1 , ..., 20 y in 1 , ..., 10 z in 1 ,
..., 3
The total number of evaluations is 20103 600.
The cost to evaluate F in one point (x, y, z) is
t.
The total running time is t 600.
If t is equal to 3 hours.
The total running time for the 600 evaluations is
1800 hours ? 75 days

16
Speed-up
17
Models
18
The Sequential Model

The RAM model express computations on von Neumann
architectures.
The von Neumann architecture is universally
accepted for sequential computations.

RAM
Von Neumann
19
The Parallel Model
PRAM

Computational Models
Programming Models
Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
20
Address-Space Organization
21
Digital AlphaServer 8400
Hardware

Shared Memory
BusTopology

C4-CEPBA
10 Alpha processors21164
2 Gb Memory
8,8 Gflop/s

22
SGI Origin 2000
Hardware

Shared Dsitributed Memory
Hypercubic Topology

C4-CEPBA
64 R1000 processos
8 Gb memory
32 Gflop/s

23
The SGI Origin 3000 Architecture (1/2)

jen50.ciemat.es
160 processors MIPS R14000 / 600MHz
On 40 nodes with 4 processors each
Data and instruction cache on-chip
Irix Operating System
Hypercubic Network

24
The SGI Origin 3000 Architecture (2/2)

cc-Numa memory Architecture
1 Gflops Peak Speed
8 MB external cache
1 Gb main memory each proc.
1 Tb Hard Disk

25
Beowulf Computers

COTS Commercial-Off-The-Shelf computers

Distributed Memory

26
Towards Grid Computing.
Source www.globus.org updated
27
The Parallel Model
PRAM

Computational Models
Programming Models
Architectural Models

BSP, LogP
PVM, MPI, HPF, Threads, OpenMP
Parallel Architectures
28
Drawbacks that arise when solving Problems using
Parallelism

Parallel Programming is more complex than
sequential.
Results may vary as a consequence of the
intrinsic non determinism.
New problems. Deadlocks, starvation...
Is more difficult to debug parallel programs.
Parallel programs are less portable.

29
The Message Passing Model
30
The Message Passing Model
Send(parameters) Recv(parameters)
31
(No Transcript)
32
MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
33
MPI

What Is MPI?
Message Passing Interface standard
The first standard and portable message
passing library with good performance
"Standard" by consensus of MPI Forum
participants from over 40 organizations
Finished and published in May 1994, updated
in June 1995
What does MPI offer?
Standardization - on many levels
Portability - to existing and new systems
Performance - comparable to vendors'
proprietary libraries
Richness - extensive functionality, many
quality implementations

34
A Simple MPI Program
MPI hello.c include ltstdio.hgt include
"mpi.h" main(int argc, charargv) int name,
p, source, dest, tag 0 MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD,name)
MPI_Comm_size(MPI_COMM_WORLD,p) printf(Hello
from processor d of d\n",name, p)
MPI_Finalize()

35
A Simple MPI Program
MPI helloms.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)

if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()

36
Linear Model to Predict Communication Performance

Time to send N bytes ? n b

37
Performace Prediction Fast, Gigabit Ethernet,
Myrinet
38
Basic Communication Operations
39
One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
40
Broadcast on Hypercubes
41
Broadcast on Hypercubes
42
MPI Broadcast

int MPI_Bcast(
void buffer
int count
MPI_Datatype datatype
int root
MPI_Comm comm
)
Broadcasts a message from the
process with rank "root" to
all other processes of the group

43
Reduction on Hypercubes
A6 110

_at_ conmutative and associative operator
Ai in processor i
Every processor has to obtain A0_at_A1_at_..._at_AP-1

A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
44
Reductions with MPI

int MPI_Reduce(
void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opint
rootMPI_Comm comm)
Reduces values on all processes to a single value
processes

int MPI_Allreduce(
void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opMPI_Comm
comm)
Combines values form all processes and
distributes the result back to all

45
All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
46
MPI Collective Operations

MPI Operator Operation
----------------------------------------
-----------------------
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bitwise and
MPI_LOR logical or
MPI_BOR bitwise or
MPI_LXOR logical
exclusive or
MPI_BXOR bitwise
exclusive or
MPI_MAXLOC max value and
location
MPI_MINLOC min value and
location

47
Computing ? Sequential
double t, pi0.0, w long i, n ... double
local, pi 0.0 .... h 1.0 / (double) n
for (i 0 i lt n i) x (i 0.5)
h pi f(x) pi h
4
2
0.0
0.2
0.4
0.6
0.8
1.0
48
Computing ? Parallel
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i name i lt n i
numprocs) x h (i 0.5) h
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
49
The Master Slave Paradigm
Master
Slaves
50
Condor University Wisconsin-Madison.
www.cs.wisc.edu/condor

A problem to be solved many times over several
different inputs.
The problem to be solved is computational
expensive.

51
Condor

Condor is a specialized workload management
system for compute-intensive jobs.
Like other full-featured batch systems, Condor
provides a job queueing mechanism, scheduling
policy, priority scheme, resource monitoring, and
resource management.
Users submit their serial or parallel jobs to
Condor, Condor places them into a queue, chooses
when and where to run the jobs based upon a
policy, carefully monitors their progress, and
ultimately informs the user upon completion.
Condor can be used to manage a cluster of
dedicated compute nodes (such as a "Beowulf"
cluster). In addition, unique mechanisms enable
Condor to effectively harness wasted CPU power
from otherwise idle desktop workstations.
In many circumstances Condor is able to
transparently produce a checkpoint and migrate a
job to a different machine which would otherwise
be idle.
As a result, Condor can be used to seamlessly
combine all of an organization's computational
power into one resource.

52
What is OpenMP?

Application Program Interface (API) for shared
memory parallel programming
What the application programmer inserts into code
to make it run in parallel
Addresses only shared memory multiprocessors
Directive based approach with library support
Concept of base language and extensions to base
language
OpenMP is available for Fortran 90 / 77 and C /
C

53
OpenMP is not...

Not Automatic parallelization
User explicitly specifies parallel execution
Compiler does not ignore user directives even if
wrong
Not just loop level parallelism
Functionality to enable coarse grained
parallelism
Not a research project
Only practical constructs that can be implemented
with high performance in commercial compilers
Goal of parallel programming application speedup
Simple/Minimal with Opportunities for
Extensibility

54
Why OpenMP?

Parallel programming landscape before OpenMP
Standard way to program distributed memory
computers (MPI and PVM)
No standard API for shared memory programming
Several vendors had directive based APIs for
shared memory programming
Silicon Graphics, Cray Research, Kuck
Associates, Digital Equipment .
All different, vendor proprietary
Sometimes similar but with different spellings
Most were targeted at loop level parallelism
Limited functionality - mostly for parallelizing
loops

55
Why OpenMP? (cont.)

Commercial users, high end software vendors have
big investment in existing code
Not very eager to rewrite their code in new
languages
Performance concerns of new languages
End result users who want portability forced to
program shared memory machines using MPI
Library based, good performance and scalability
But sacrifice the built in shared memory
advantages of the hardware
Both require major investment in time and money
Major effort entire program needs to be
rewritten
New features needs to be curtailed during
conversion

56
The OpenMP API

Multi-platform shared-memory parallel programming
OpenMP is portable supported by Compaq, HP, IBM,
Intel, SGI, Sun and others on Unix and NT
Multi-Language C, C, F77, F90
Scalable
Loop Level parallel control
Coarse grained parallel control

57
The OpenMP API

Single source parallel/serial programming
OpenMP is not intrusive (to original serial
code).
Instructions appear in comment statements for
Fortran and through pragmas for C/C
Incremental implementation
OpenMP programs can be implemented incrementally
one subroutine (function) or even one do (for)
loop at a time

!omp parallel do do i 1, n ... enddo
pragma omp parallel for for (i 0 I lt n
i) ...
58
Threads

Multithreading
Sharing a single CPU between multiple tasks (or
"threads") in a way designed to minimise the time
required to switch threads
This is accomplished by sharing as much as
possible of the program execution environment
between the different threads so that very little
state needs to be saved and restored when
changing thread.
Threads share more of their environment with each
other than do tasks under multitasking.
Threads may be distinguished only by the value of
their program counters and stack pointers while
sharing a single address space and set of global
variables

59
OpenMP Overview How do threads interact?

OpenMP is a shared memory model
Threads communicate by sharing variables
Unintended sharing of data causes race
conditions
race condition when the programs outcome
changes as the threads are scheduled differently
To control race conditions
Use synchronizations to protect data conflicts
Synchronization is expensive so
Change how data is accessed to minimize the need
for synchronization

60
OpenMP Parallel Computing Solution Stack
User layer
Environment Variables
Prog. Layer (OpenMP API)
Directives
OpenMP Library
Runtime Library
System layer
OS/system support for shared memory
61
Reasoning about programming

Programming is a process of successive refinement
of a solution relative to a hierarchy of models
The models represent the problem at a different
level of abstraction
The top levels express the problem in the
original problem domain
The lower levels represent the problem in the
computers domain
The models are informal, but detailed enough to
support simulation
Source J.-M. Hoc, T.R.G. Green, R. Samurcay and
D.J. Gilmore (eds.), Psychology of Programming,
Academic Press Ltd., 1990

62
Layers of abstraction in Programming
Domain
Model Bridges between domains
Problem
Algorithm
OpenMP only defines these two!
Source Code
Computation
Cost
Hardware
63
The OpenMP Computational Model

OpenMP was created with a particular abstract
machine or computational model in mind
Multiple processing elements
A shared address space with equal-time access
for each processor
Multiple light weight processes (threads) managed
outside of OpenMP (the OS or some other third
party)

64
The OpenMP programming model

fork-join parallelism
Master thread spawns a team of threads as needed
Parallelism is added incrementally i.e. the
sequential program evolves into a parallel program

Master Thread
A nested Parallel Region
Parallel Regions
65
So, How good is OpenMP?

A high quality programming environment supports
transparent mapping between models
OpenMP does this quite well for the models it
defines
Programming model
Threads forked by OpenMP map onto threads in
modern OSs.
Computational model
Multiprocessor systems with cache coherency map
onto OpenMP shared address space

66
What about the cost model?

OpenMP doesnt say much about the cost model
programmers are left to their own devices
Real systems have memory hierarchies, OpenMPs
assumed machine model doesnt
Caches mean some data is closer to some
processors
Scalable multiprocessor systems organize their
RAM into modules - another source of NUMA
OpenMP programmers must deal with these issues as
they
Optimize performance on each platform
Scale to run onto larger NUMA machines

67
What about the specification model?

Programmers reason in terms of a specification
model as they design parallel algorithms
Some parallel algorithms are natural in OpenMP
Specification models implied by loop-splitting
and SPMD algorithms map well onto OpenMPs
programming model
Some parallel algorithms are hard for OpenMP
Recursive problems and list processing is
challenging for OpenMPs models

68
Is OpenMP a good API?
Model Bridges between domains
Fair (5)
Good (8)
Good (9)
Poor (3)
Overall score OpenMP is OK (6), but it sure
could be better!
69
OpenMP today

Hardware Vendors
Compaq/Digital (DEC)
Hewlett-Packard (HP)
IBM
Intel
Silicon Graphics
Sun Microsystems

3rd Party Software Vendors Absoft Edinburgh
Portable Compilers (EPC) Kuck Associates
(KAI) Myrias Numerical Algorithms Group
(NAG) Portland Group (PGI)
70
The OpenMP components

Directives
Environment Variables
Shared / Private Variables
Runtime Library
OS Threads

71
The OpenMP directives

Directives are special comments in the language
Fortran fixed form !OMP, COMP, OMP
Fortran free form !OMP
Special comments are interpreted by OpenMP
compilers

w 1.0/n sum 0.0 !OMP PARALLEL DO
PRIVATE(x) REDUCTION(sum) do I1,n
x w(I-0.5) sum sum f(x) end
do pi wsum print ,pi end
Comment in Fortran but interpreted by
OpenMP compilers
Dont worry about details now!
72
The OpenMP directives

Look like a comment (sentinel / pragma syntax)
F77/F90 !OMP directive_name clauses
C/C pragma omp pragmas_name clauses
Declare start and end of multithread execution
Control work distribution
Control how data is brought into and taken out of
parallel sections
Control how data is written/read inside sections

73
The OpenMP environment variables

OMP_NUM_THREADS - number of threads to run in a
parallel section
MPSTKZ size of stack to provide for each thread
OMP_SCHEDULE - Control state of scheduled
executions.
setenv OMP_SCHEDULE "STATIC, 5
setenv OMP_SCHEDULE "GUIDED, 8
setenv OMP_SCHEDULE "DYNAMIC"
OMP_DYNAMIC
OMP_NESTED

74
Shared / Private variables

Shared variables can be accessed by all of the
threads.
Private variables are local to each thread
In a typical parallel loop, the loop index is
private, while the data being indexed is shared

!omp parallel !omp parallel do !omp
shared(X,Y,Z), private(I) do I1, 100
Z(I) X(I) Y(I) end do !omp end parallel
75
OpenMP Runtime routines

Writing a parallel section of code is matter of
asking two questions
How many threads are working in this section?
Which thread am I?
Other things you may wish to know
How many processors are there?
Am I in a parallel section?
How do I control the number of threads?
Control the execution by setting directive state

76
Os threads

In the case of Linux, it needs to be installed
with an SMP kernel.
Not a good idea to assign more threads than CPUs
available
omp_set_num_threads(omp_get_num_procs())

77
A simple example computing ?
78
Computing ?
double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... for(i 0 i lt n i) t (i 0.5)
w pi 4.0/(1.0 tt) pi w ...
79
Computing ?

pragma omp directives in C
Ignored by non-OpenMP compilers

double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... pragma omp parallel for reduction(pi)
private(i,t) for(i 0 i lt n i) t (i
0.5) w pi 4.0/(1.0 tt) pi w ...
80
Computing ? on a SunFire 6800
81
Compiling OpenMP programs

OpenMP directives are ignored by default
Example SGI Irix platforms
f90 -O3 foo.f
cc -O3 foo.c
OpenMP directives are enabled with -mp
Example SGI Irix platforms
f90 -O3 -mp foo.f
cc -O3 -mp foo.c

82
Fortran example

OpenMP directives used
comp parallel clauses
comp end parallel

program f77_parallel
implicit none
integer n, m, i, j
parameter (n10, m20)
integer a(n,m)
comp parallel default(none)
comp private(i,j) shared(a)
do j1,m
do i1,n
a(i,j)i(j-1)n
enddo
enddo
comp end parallel
end

Parallel clauses include
default(noneprivateshared)
private(...)
shared(...)

83
Fortran example

program f77_parallel
implicit none
integer n, m, i, j
parameter (n10, m20)
integer a(n,m)
comp parallel default(none)
comp private(i,j) shared(a)
do j1,m
do i1,n
a(i,j)i(j-1)n
enddo
enddo
comp end parallel
end

Each arrow denotes one thread
All threads perform identical task

84
Fortran example

With default scheduling,
Thread a works on j 15
Thread b on j 610
Thread c on j 1115
Thread d on j 1620

program f77_parallel
implicit none
integer n, m, i, j
parameter (n10, m20)
integer a(n,m)
comp parallel default(none)
comp private(i,j) shared(a)
do j1,m
do i1,n
a(i,j)i(j-1)n
enddo
enddo
comp end parallel
end

a
b
c
d
85
The Foundations of OpenMP
Layers of abstractions or models used to
understand and use OpenMP
86
Summary of OpenMP Basics

Parallel Region (to create threads)
Comp parallel pragma omp parallel
Worksharing (to split-up work between threads)
Comp do pragma omp for
Comp sections pragma omp sections
Comp single pragma omp single
Comp workshare
Data Environment (to manage data sharing)
directive threadprivate
clauses shared, private, lastprivate,
reduction, copyin, copyprivate
Synchronization
directives critical, barrier, atomic, flush,
ordered, master
Runtime functions/environment variables

87
Improvements in black hole detection using
parallelism
Francisco Almeida1, Evencio Mediavilla2, Álex
Oscoz2 and Francisco de Sande1
1 Departamento de Estadística, I.O. y
Computación
2 Instituto de Astrofísica de
Canarias
Universidad de La Laguna Tenerife, Canary
Islands, SPAIN Dresden, September 3
2003
88
Introduction (1/3)

Very frequently there is a divorce between
computer scientists and researchers in other
scientific disciplines
This work collects the experiences of a
collaboration between researchers coming from two
different fields astrophysics and parallel
computing
We present different approaches to the
parallelization of a scientific code that solves
an important problem in astrophysics
the detection of supermassive black holes

89
Introduction (2/3)

The IAC co-authors initially developed a
Fortran77 code solving the problem
The execution time for this original code was not
acceptable and that motivated them to contact
with researchers with expertise in the parallel
computing field
We know in advance that these scientist
programmers deal with intense time-consuming
sequential codes that are not difficult to tackle
using HPC techniques
Researchers with a purely scientific background
are interested in these techniques, but they are
not willing to spend time learning about them

90
Introduction (3/3)

One of our constraints was to introduce the
minimum amount of changes in the original code
Even with the knowledge that some optimizations
could be done in the sequential code
To preserve the use of the NAG functions was also
another restriction in our development

91
Outline

The Problem
Black holes and quasars
The method gravitational lensing
Fluctuations in quasars light curves
Mathematical formulation of the problem
Sequential code
Parallelizations MPI, OpenMP, Mixed MPI/OpenMP
Computational results
Conclusions

92
Black holes

Supermassive black holes (SMBH) are supposed to
exist in the nucleus of many if not all the
galaxies
Some of these objects are surrounded by a disk of
material continuously spiraling towards the deep
gravitational potential pit of the SMBH and
releasing huge quantities of energy giving rise
to the phenomena known as quasars (QSO)

93
The accretion disk
94
Quasars (Quasi Stellar Objects, QSO)

QSOs are currently believed to be the most
luminous and distant objects in the universe
QSOs are the cores of massive galaxies with super
giant black holes that devour stars at a rapid
rate, enough to produce the amount of energy
observed by a telescope

95
The method

We are interested in objects of dimensions
comparable to the Solar System in galaxies very
far away from the Milky Way
Objects of this size can not be directly imaged
and alternative observational methods are used to
study their structure
The method we use is the observation of QSO
images affected by a microlensing event to study
the structure of the accretion disk

96
Gravitational Microlensing

Gravitational lensing (the attraction of light by
matter) was predicted by General Relativity and
observationally confirmed in 1919
If light from a QSO pass through a galaxy located
between the QSO and the observer it is possible
that a star in the intervening galaxy crosses the
QSO light beam
The gravitational field of the star amplifies the
light emission coming from the accretion disk
(gravitational microlensing)

97
Microlensing Quasar-Star
98
Microlensing

The phenomenon is more complex because the
magnification is not due to a single isolated
microlens, but it rather is a collective effect
of many stars
As the stars are moving with respect to the QSO
light beam, the amplification varies during the
crossing

99
Microlensing
100
Double Microlens
101
Multiple Microlens

Magnification pattern in the source plane,
produced by a dense field of stars in the lensing
galaxy.
The color reflects the magnification as a
function of the quasar position the sequence
blue-green-red-yellow indicates increasing
magnification

102
Q22370305 (1/2)

So far the best example of a microlensed quasar
is the quadruple quasar Q22370305

103
Q22370305 (2/2)
104
Fluctuations in Q22370305 light curves

Lightcurves of the four images of Q22370305 over
a period of almost ten years
The changes in relative brightness are very
obvious

105
Q22370305

In Q22370305, and thanks to the unusually small
distance between the observer and the lensing
galaxy, the microlensing events have a timescale
of the order of months
We have observed Q22370305 from 1999 October
during approximately 4 months at the Roque de los
Muchachos Observatory

106
Fluctuations in light curves

The curve representing the change in luminosity
of the QSO with time depends on the position of
the star and also on the structure of the
accretion disk
Microlens-induced fluctuations in the observed
brightness of the quasar contain information
about the light-emitting source (size of
continuum region or broad line region of the
quasar, its brightness profile, etc.)
Hence from a comparison between observed and
simulated quasar microlensing we can draw
conclusions about the accretion disk

107
Q22370305 light curves (2/2)

Our goal is to model light curves of QSO images
affected by a microlensing event to study the
unresolved structure of the accretion disk

108
Mathematical formulation (1/5)

Leaving aside the physical meaning of the
different variables, the function modeling the
dependence of the observed flux with time
t, can be written as

Where is the ratio between the outer and
inner radii of the accretion disk (we will adopt
).
109
Mathematical formulation (2/5)

And G is the function

To speed up the computation, G has been
approximated by using MATHEMATICA

110
Mathematical formulation (3/5)

Our goal is to estimate in the observed flux

the values of the parameters
by fitting to the observational data
111
Mathematical formulation (4/5)

Specifically, to find the values of the 5
parameters that minimize the error between the
theoretical model and the observational data
according to a chi-square criterion

Where

N is the number of data points
corresponding
to times ti (i 1, 2, ..., N)

is the theoretical function evaluated
at time ti

is the observational error associated
to each data value

112
Mathematical formulation (5/5)

The determination of the minimum in the
5-parameters space depends on the initial
conditions, so
we consider a 5-dimensional grid of starting
points
and m sampling intervals in each variable
for each one of the points of the this grid we
compute a local minimum
Finally, we select the absolute minimum among them

To minimize we use the e04ccf NAG routine,
that only requires evaluation of the function and
not of the derivatives

113
The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
114
The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
115
Sequential Times

In a Sun Blade 100 Workstation running Solaris
5.0 and using the native Fortran77 Sun compiler
(v. 5.0) with full optimizations this code takes
5.89 hours for sampling intervals of size m4
12.45 hours for sampling intervals of size m5
In a SGI Origin 3000 using the native MIPSpro F77
compiler (v. 7.4) with full optimizations
0.91 hours for m4
2.74 hours for m5

116
Loop transformation
program seq_black_hole2 implicit
none parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5) integer k1,
k2, k3, k4, k5 common/var/t, fit c
Data input c Initialize best solution
do k 1, m5 c Index transformation c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo end seq_black_hole
117
MPI OpenMP (1/2)

In the last years OpenMP and MPI have been
universally accepted as the standard tools to
develop parallel applications
OpenMP
is a standard for shared memory programming
It uses a fork-join model and
is mainly based on compiler directives that are
added to the code that indicate the compiler
regions of code to be executed in parallel
MPI
Uses an SPMD model
Processes can read and write only to their
respective local memory
Data are copied across local memories using
subroutine calls
The MPI standard defines the set of functions and
procedures available to the programmer

118
MPI OpenMP (2/2)

Each one of these two alternatives have both
advantages and disadvantages
Very frequently it is not obvious which one
should be selected for a specific code
MPI programs run on both distributed and shared
memory architectures while OpenMP run only on
shared memory
The abstraction level is higher in OpenMP
MPI is particularly adaptable to coarse grain
parallelism. OpenMP is suitable for both
coarse and fine grain parallelism
While it is easy to obtain a parallel version of
a sequential code in OpenMP, usually it requires
a higher level of expertise in the case of MPI. A
parallel code can be written in OpenMP just by
inserting proper directives in a sequential code,
preserving its semantics, while in MPI mayor
changes are usually needed
Portability is higher in MPI

119
MPI-OpenMP hybridization (1/2)

Hybrid codes match the architecture of SMP
clusters
SMP clusters are an increasingly popular
platforms
MPI may suffer from efficiency problems in shared
memory architectures
MPI codes may need too much memory
Some vendors have attempted to exted MPI in
shared memory, but the result is not as efficient
as OpenMP
OpenMP is easy to use but it is limited to the
shared memory architecture

120
MPI-OpenMP hybridization (2/2)

An hybrid code may provide better scalability
Or simply enable a problem to exploit more
processors
Not necessarily faster than pure MPI/OpenMP
It depends on the code, architecture, and how the
programming models interact

121
MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
122
MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
123
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
124
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
125
OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
126
Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
127
Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
128
The SGI Origin 3000 Architecture (1/2)

jen50.ciemat.es
160 processors MIPS R14000 / 600MHz
On 40 nodes with 4 processors each
Data and instruction cache on-chip
Irix Operating System
Hypercubic Network

129
The SGI Origin 3000 Architecture (2/2)

cc-Numa memory Architecture
1 Gflops Peak Speed
8 MB external cache
1 Gb main memory each proc.
1 Tb Hard Disk

130
Computational results (m4)

As we do not have exclusive mode access to the
architecture, the times correspond to the minimum
time from 5 different executions
The figures corresponding to the mixed mode code
(label MPI-OpenMP) correspond to the minimum
times obtained for different combinations of MPI
processes/OpenMP threads

131
Parallel execution time (m4)
132
Speedup (m4)
133
Results for mixed MPI-OpenMP
134
Computational results (m5)
135
Parallel execution time (m5)
136
Speedup (m5)
137
Results for mixed MPI-OpenMP (m5)
138
Conclusions (1/3)

The computational results obtained from all the
parallel versions confirm the robustness of the
method
For the case of non-expert users and the kind of
codes we have been dealing with, we believe that
MPI parallel versions are easier and safer
In the case of OpenMP, the proper usage of data
scope attributes for the variables involved may
be a handicap for users with non-parallel
programming expertise
The higher current portability of the MPI version
is another factor to be considered

139
Conclusions (2/3)

The mixed MPI/OpenMP parallel version is the most
expertise-demanding
Nevertheless, as it has been stated by several
authors and even for the case of a hybrid
architecture, this version does not offer any
clear advantage and it has the disadvantage that
the combination of processes/threads has to be
tuned

140
Conclusions (3/3)

We conclude a first step of cooperation in the
way of applying HPC techniques to improve
performance in astrophysics codes
The scientific aim of applying HPC to
computationally-intensive codes in astrophysics
has been successfully achieved

141
Conclusions and Future Work

The relevance of our results do not come directly
from the particular application chosen, but from
stating that parallel computing techniques are
the key to broach large scale real problems in
the mentioned scientific field
From now on we plan to continue this fruitful
collaboration by applying parallel computing
techniques to some other astrophysical challenge
problems

142
Supercomputing Centers

http//www.ciemat.es/
http//www.cepba.upc.es/
http//www.epcc.ed.ac.uk/

143
MPI links

Message Passing Interface Forum
http//www.mpi-forum.org/
MPI The Complete Reference
http//www.netlib.org/utk/papers/mpi-book/mpi-boo
k.html
Parallel Programming with MPI By Peter Pacheco.
http//www.cs.usfca.edu/mpi/

144
OpenMP links

http//www.openmp.org/
http//www.compunity.org/

145
Gracias!
Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande (falmeida,
fsande)_at_ull.es
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004

Write a Comment

User Comments (0)