Optimising MPI Applications for Metacomputers by Using MetaMPICH and MetaComm presentation

About This Presentation

Transcript and Presenter's Notes

Title: Optimising MPI Applications for Metacomputers by Using MetaMPICH and MetaComm

1
Optimising MPI Applicationsfor Metacomputers by
UsingMetaMPICH and MetaComm
Carsten Clauss, Martin Pöppe, Thomas Bemmerl
carsten_at_lfbs.rwth-aachen.de http//www.mp-mpich.de
2
Table of Contents

Part 1 Metacomputing / Coupled Clusters
What is a Metacomputer?
Our Solution MetaMPICH
Advantages / Disadvantages
Part 2 Optimising MPI Applications
Boundary Value Problems on Structured Grids
Load Balancing on Metacomputers
Our Solutions SmartPart / MetaComm

This presentation is about metacomputing and the
linking of clusters and we will showapproaches
to optimize MPI applications for those
heterogeneous systems.So, this presentation is
divided into two parts.
In the first part we cover the following
topicsWhat is a metacomputer? and How to
build a metacomputer?
Then we will present our own software solution
called MetaMPICH.
And what are the advantages and disadvantages of
metacomputing?
3
Table of Contents

Part 1 Metacomputing / Coupled Clusters
What is a Metacomputer?
Our Solution MetaMPICH
Advantages / Disadvantages
Part 2 Optimising MPI Applications
Boundary Value Problems on Structured Grids
Load Balancing on Metacomputers
Our Solutions SmartPart / MetaComm

In the second part we will show approaches to
optimize MPI applications for thosemetacomputers.
And we will demonstrate them for a simple class
of applications,which solve boundary value
problems on structured grids.
Then, we will deal with load balancing and domain
decomposition for metacomputers.
And finally, we will present our own software
solutions for those problems,called SmartPart
and MetaComm.
4
Table of Contents

Part 1 Metacomputing / Coupled Clusters
What is a Metacomputer?
Our Solution MetaMPICH
Advantages / Disadvantages
Part 2 Optimising MPI Applications
Boundary Value Problems on Structured Grids
Load Balancing on Metacomputers
Our Solutions SmartPart / MetaComm

For the present back to Part 1 and the question
What is a Metacomputer?
5
Coupled Clusters
SCI
In this example the cluster nodes are internally
connected by the Scalable CoherentInterface
(SCI), which is a very fast network that enables
remote shared memory access.
We want to explain this for the case of coupled
clusters.A cluster is a compound of independent
computational nodes, which are connected viaa
very fast network, such as SCI, Myrinet or
Infiniband.
6
Coupled Clusters
Cluster A
Cluster B
ATM
SCI
SCI
A further architectural conception is to link
distant clusters by a wide area interconnection
(like ATM) to a higher unit.
7
Coupled Clusters
Cluster A
Cluster B
ATM
SCI
SCI
SCI
SCI
SCI
Metacomputer
If this system, which can be understood as a
cluster of clusters, is viewed by anapplication
as one transparent parallel computer, then we
call it a metacomputer.
8
Coupled Clusters
Cluster A
Cluster B
Metahost A
Metahost B
ATM
SCI
SCI
SCI
SCI
SCI
Metacomputer
And in this context each cluster of this
metacomputer is called a metahost.
9
Example of a Possible Use
Nice idea,but how?
Düsseldorf
Jülich
Köln
Aachen
Bonn
This is an example of a thinkable use.Think
about various computer centers, for example in
North-Rhine-Westphalia, Germany.
And now we couple those centers to one big
metacomputer.Nice idea, but how to realize it?
10
How to build a Metacomputer?

MetaMPICH
MP-MPICH

Transparent MPI system on metacomputers
- Windows 2000/XP, Linux, Solaris - SCI, TCP,
SHMEM internal- TCP and AAL5 external
One solution is to use our own software library
called MetaMPICH.MetaMPICH, which is an
extension to the well-known MPICH library,
provides a fullytransparent MPI system for
applications on a metacomputer.
MetaMPICH is also part of our Multi-Platform
MPICH project, where we try to providedifferent
network technologies as well as different
operating systems.
Currently, we support Windows, as well as Linux
and Solaris. And we provide SCI, TCP and shared
memory for cluster internal communication. And
TCP and ATM are provided for the external
coupling communication.
11
Architecture of MetaMPICH
Metahost B
Metahost A
SCI
SCI
SCI
Router
Router
This router process, which has got access to the
inter-metahost interconnection,forwards the
message to its counterpart in the remote
metahost.And finally the remote router tunnels
the messages to the receiver.
To provide a fully transparent MPI system to the
application, MetaMPICH uses additionalcommunicati
on devices and so called router-processes.
Messages to processes within a metahost are sent
as usual via the nativecommunication device.
But messages to processes in a remote metahost
are firstly sent via a gatewayto the local
router process, as you can see in this example.
12
Architecture of MetaMPICH
Router
Router
Router
Router
SCI
Router
Router
MetaMPICH also provides a random number of
metahosts.On this slide you can see an example
for a configuration of three metahosts.
But, as you can see, for each metahost you need
a dedicated pair of router processes.
13
Metacomputing

Advantage
Disadvantage

Many more MPI processes by using the remote
computational power
Inter-cluster communication is thesystems
bottleneck
Existing applications cannot benefit from the
fast internal networks ?
The advantage of using a metacomputer seems to be
obviousThe coupled system offers your
application the use of much more processes by
usingthe remote computational power.
What are the advantages and disadvantages of
metacomputing?
But the main disadvantage is that the
inter-cluster interconnection obviously
constitutesthe systems bottleneck.
If you are running existing MPI applications on
this transparent system you are not ableto
benefit from the fast internal cluster-networks.
14
Process Grouping
Division of the algorithmic problem
So, if you want to benefit, you must try to
divide your algorithmic problem into groupsof
processes, which communicate a lot inside a
group, but do not communicate a lotbetween the
groups.
15
Process Grouping
MPI_COMM_WORLD
MetaMPICH new MPI Communicator
MPI_COMM_LOCAL
MPI_COMM_LOCAL
This communicator reflects the group of
application processes on each local
metahost,while MPI_COMM_WORLD describes all
application processes of the metacomputer.
To support such a division into groups, MetaMPICH
offers the new MPI communicatorMPI_COMM_LOCAL.
So, this communicator allows an application to
differ explicitly between cluster internaland
cluster external communication.
16
Metacomputing

Pay Attention to Load Balance !!!

Clusters do NOT need to be identical !!!
- Different CPU Power- Different Number of
Nodes- Different Internal Networks
This is, because no one said the metahosts need
to be identical. Different processor power,
different number of nodes and also different
internal cluster-networks can lead to varying
performance between the metahosts.
But even if you can find a smart division of your
problem into such groups of processes,you also
have to pay a proper attention to load balancing.
17
Metacomputing

Challenges

Inter-Metahost Communication Bottleneck
Heterogeneous System Load Balance !!!
So, let us summarize the two challenges of
metacomputing - If possible, you have to
consider the inter-metahost communication
bottleneck!
- You have to pay additional attention to a fair
load distribution, because a metacomputer is
always a heterogeneous system!
18
Table of Contents

Part 1 Metacomputing / Coupled Clusters
What is a Metacomputer?
Our Solution MetaMPICH
Advantages / Disadvantages
Part 2 Optimising MPI Applications
Boundary Value Problems on Structured Grids
Load Balancing on Metacomputers
Our Solutions SmartPart / MetaComm

In the second part of this presentation we will
show approaches to handle those challenges.And
we will demonstrate this for a very simple class
of applications, whose cores solveboundary value
problems on structured grids.
19
Boundary Value Problems
Distribution of Temperature in a Plate
DT
T(x,y)
A very simple boundary value problem is the
steady-state distribution of temperature in atwo
dimensional plate.
This problem is described by the well-known
Laplace equation and the given temperatureson
the plates borders.
20
Boundary Value Problems
Discretised Problem
T(xi,yj)
To solve this problem on a computer in a
numerical way, we discretise the problem bymeans
of a two dimensional grid.
21
Boundary Value Problems
Memory Cells
The values on the grid points are now represented
by memory cells in the computer.
22
Boundary Value Problems

Discretising the Problem
Iterative Solver

big and sparsely populated linear equation systems
simple example Jacobi-or Gauss-Seidel method
The resulting linear equation system can easily
be solved by an iterative solver, likethe Jacobi
or Gauss-Seidel method.
Discretising the problem leads to big and
sparsely populated linear equation systems.
23
Boundary Value Problems
i-1, j
i, j-1
Iteration Rule
i, j1
i, j1
T i j 0.25 ( T i-1 j T i1 j
T i j-1 T i j1 )

Large number of iterations over the grid
Approximation for the solution

This leads to the following iteration rule, known
as the five-point stencil.You can see, that each
grid point is calculated by the values of it four
direct neighbours.
24
Boundary Value Problems
Discrete Solution
When using this rule and iterating several times
over the grid, we approach the searchedsolution
of the problem.
25
Parallelisation
Domain Decomposition
Proc0
Proc1
Exchange into Ghost Lines
Proc1
To parallelise this algorithm for example for two
processes, we divide the grid by a simplecut, so
that each process works on its own subdomain.
A problem occurs, when a process is working at
the border to other process.As you can see, the
process needs values which the other process
holds, and vice versa.
So, these values must be exchanged into
additional ghost lines.
26
Using a Metacomputer
Grouping of Processes?
Lets transfer this parallel algorithm to a
metacomputer and think about the problems wehave
mentioned in part one of this presentation.
First, is there a possibility to group the
processes in a way, that we can reduce
theinter-metahost-communication?
27
Using a Metacomputer
Inner Boundaries
In most cases, the given boundary value problem
specifies additional boundaries withinthe grid.
These inner boundaries are fixed during the
iterations or can be calculated byeach process
on it own.
The underlying idea is now to disregard these
fixed values during communication, becausethey
do not need to be exchanged. And so the message
length and thus the communication effort can be
reduced.
28
Using a Metacomputer
Reduced Communication
Cluster A
Cluster B
Assume that the grey cells in this example
represent these fixed values. When dividing the
grid by this cut onto the metahosts the
inter-cluster-communication isreduced to the
lower part of the grid.
29
Using a Metacomputer
Load Balance?
But what about load balance?
30
Using a Metacomputer
Domain Distribution onto Processes
ProcB1
ProcA0
ProcB2
ProcA1
ProcB3
Assume the following domain distribution onto the
processes. In this simple example cluster A has
got two processors, while cluster B has got
threeprocessors.
If all processors have got the same power, this
is NOT a fair distribution of thecomputational
load.
31
Using a Metacomputer
Load Balance VS Communication
ProcA0
ProcB1
ProcB2
ProcB0
ProcA1
This decomposition scheme might be fairer, but
here the inter-cluster-communication isincreased
again.
32
Smart Partitioner

A smart partition scheme provides

- Load Balance and - Reduced Communication (if
possible)
A smart partition scheme depends on
- the given boundary value problem - the
structures of the metacomputer
Smart Partitioner (SmartPart)
To find a smart partition scheme that provides
load balance AND a reduced inter-cluster-communica
tion is difficult. The solution is to do the
domain decomposition in an algorithmic way.
And so, for research purpose, we developed a
smart partitioner called SmartPart,which can do
this for this simple class of applications.
33
Smart Partitioner
Determination of Cut Metrics
By means of performance measurements SmartPart
determines for every possible cutof the
decomposition a communication and load balance
metric.
34
Smart Partitioner
Communication Metric
For the presented example of two metahost and a
vertical cut, this is the communicationmetric.
As you can see, the metric is small, where the
message length would be short.
And as you can see, the metric is zero on the
borders of the entire grid. This is because if
one cluster would do the whole work, NO messages
have to beexchanged between them.
35
Smart Partitioner
Communication Metric
Load Balance Metric
And on this slide you can see the load balance
metric, if the power ratio is three to two. Here
also the lowest metric represents the best cut.
36
Smart Partitioner
Superposition
best cut
The superposition of communication and load
balance metric finally determines thebest cut.
37
Smart Partitioner
Decomposition Patterns
three metahosts / first cut horizontal
Since SmartPart iterates over all possible
decomposition patterns, an optimal
partitionscheme will mostly be found.
On this slide you can see an example for possible
decomposition patterns for threemetahosts when
beginning with a horizontal cut.
38
Smart Partitioner
Decomposition Patterns
three metahosts / first cut vertical
And on this slide you can see an example for
possible decomposition patterns for
threemetahosts when beginning with a vertical
cut.
As you can see, SmartPart only works with
horizontal and vertical cuts, and this isbecause
we are only dealing with structured grids.
39
Example CFD Simulation
The colour describes the pressure in the channel.
Red colour means high pressure.Blue colour means
low pressure. The white dashes represent the
velocity.And the black areas are barriers and
thus the fixed values within the grid.
This is a screenshot from a simple CFD
application.It is the simulation of a flow
channel, where the water or the fluid flows from
the left tothe right side.
40
Example CFD Simulation
Cluster B
Cluster A
This is an example of a possible decomposition
delivered by SmartPart for two metahosts.Here
Cluster A has got four nodes and cluster B has
got eight nodes.
As you can see, the communication between the
clusters is reduced to a very small area,denoted
by the white arrows.
41
Example CFD Simulation
Cluster B
Cluster A
Cluster C
This a possible domain decomposition for three
metahosts.And here the communication between the
clusters is reduced, too.
42
Adaptation Layer

Smart Partitioner

Optimal decomposition scheme foryour problem on
your metacomputer
How to use it in your application ?
Adaptation Layer on top of MetaMPICHyour
applications can easily attach to
Communication Library (MetaComm)
Now assume that SmartPart has found an optimal
partition scheme for the problem. How to use it
in your application?
For that purpose we developed a small adaptation
layer on top of MetaMPICH, which those
applications can easily attach to.
43
Adaptation Layer

MetaComm
Optimised Communication

- additional communication functions- can
replace all explicit MPI functions
- based on smart partition scheme- performs
every possible reduction- metacomputer is still
transparent
This layer, called MetaComm, is a small software
library, which provides simplecommunication
functions, which can replace all explicit MPI
function calls withinan application.
MetaComm is based on the predetermined partition
scheme of SmartPart and performsevery possible
reduction of inter-metahost-communication.
MetaComm keeps the metacomputer still transparent
for the applications, but at the sametime
MetaComm helps to regard the heterogeneous
structures of a metacomputer.
44
Adaptation Layer
transparent use ? NO good performance ?
normal case
Coupled Clusters
reworking the parallel part?
Application (iterative simulation)
SmartPart
Parallelising
MetaComm
MPI_COMM_LOCAL
MPI
MetaMPICH
MPI
This is the normal case You have got your
simulation application and it is parallelisedby
using the message-passing- interface.
Now you want to run your application on a
metacomputer.As mentioned, you can do this by
using our software library MetaMPICH.
But when using the metacomputer in a transparent
way, this will lead to no goodperformance.
You can think about reworking the parallel part
of your application, for example by usingthe new
MPI-communicator MPI_COMM_LOCAL.
But the best way is to use the communication
library MetaComm, which is supportedby the
decomposition schemes of SmartPart.
45
Conclusion
Metacomputing can be a powerful way to increase
performance.
Metacomputer is a heterogeneous System !!!
You should always search for waysto optimise
your applications !!!
Metacomputing can be a power full way to increase
performance. But since a metacomputer is a
heterogeneous system, you should always search
forways to optimize your applications.
46
The End
Carsten Clauss, Martin Pöppe, Thomas Bemmerl
carsten_at_lfbs.rwth-aachen.de http//www.mp-mpich.de

Write a Comment

User Comments (0)

About PowerShow.com

Optimising MPI Applications for Metacomputers by Using MetaMPICH and MetaComm PowerPoint PPT Presentation