Kickstart%20Tutorial/Seminar%20on%20using%20the%2064-nodes%20P4-Xeon%20Cluster%20in%20Science%20Faculty - PowerPoint PPT Presentation

About This Presentation

Title:

Kickstart%20Tutorial/Seminar%20on%20using%20the%2064-nodes%20P4-Xeon%20Cluster%20in%20Science%20Faculty

Description:

Kickstart Tutorial/Seminar on using the 64-nodes P4-Xeon Cluster in Science Faculty ... a kickstart tutorial to potential cluster users in Science Faculty, HKBU ... – PowerPoint PPT presentation

Number of Views:254

Avg rating:3.0/5.0

Slides: 71

Provided by: jen142

Category:

more less

Transcript and Presenter's Notes

Title: Kickstart%20Tutorial/Seminar%20on%20using%20the%2064-nodes%20P4-Xeon%20Cluster%20in%20Science%20Faculty

1
Kickstart Tutorial/Seminar on using the 64-nodes
P4-Xeon Cluster in Science Faculty

June 11, 2003

2
Aims and target audience

Aims
Provide a kickstart tutorial to potential cluster
users in Science Faculty, HKBU
Promote the usage of the PC cluster in Science
Faculty
Target audience
Science Faculty students referred by their
project/thesis supervisors
Staff who are interested in High Performance
Computing

3
Outline

Brief introduction
Hardware, software, login and policy
How to write and run program on multiple CPUs
Simple MPI programming
Resources on MPI documentation
Demonstration of software installed
SPRNG, BLAS, NAMD2, GAMESS, PGI

4
Bought by Teaching Development Grant
5
Hardware Configuration

1 master node 64 compute nodes Gigabit
Interconnection
Master node
Dell PE2650, P4-Xeon 2.8GHz x 2
4GB RAM, 36GB x 2 U160 SCSI (mirror)
Gigabit ethernet ports x 2
SCSI attached storage
Dell PV220S
73GB x 10 (RAID5)

6
Hardware Configuration (cont)

Compute nodes
Dell PE2650, P4-Xeon 2.8GHz x 2
2GB RAM, 36GB U160 SCSI HD
Gigabit ethernet ports x 2
Gigabit Interconnect
Extreme Blackdiamond 6816 Gigabit ethernet
256Gb backplane
72 Gigabit ports (8 ports card x 9)

7
Software installed

Cluster operating system
ROCKS 2.3.2 from www.rocksclusters.org
MPI and PVM libraries
LAM/MPI 6.5.9, MPICH 1.2.5, PVM 3.4.3-6beolin
Compilers
GCC 2.96, GCC 3.2.3
PGI C/C/f77/f90/hpf version 4.0
MATH libraries
ATLAS 3.4.1, ScaLAPACK, SPRNG 2.0a
Application software
MATLAB 6.1 with MPITB
Gromacs 3.1.4, NAMD2.5b1 , Gamess
Editors
vi, pico, emacs, joe
Queuing system
OpenPBS 2.3.16, Maui scheduler

8
Cluster O.S. ROCKS 2.3.2

Developed by NPACI and SDSC
Based on RedHat 7.3
Allow setup of 64 nodes in 1 hour
Useful command for users to monitor jobs in all
nodes. E.g.
cluster-fork date
cluster-ps morris
cluster-kill morris
Web based management and monitoring
http//tdgrocks.sci.hkbu.edu.hk

9
Ganglia
10
PBS Job queue
11
Hostnames

Master node
External tdgrocks.sci.hkbu.edu.hk
Internal frontend-0
Compute nodes
comp-pvfs-0-1, , comp-pvfs-0-48
Short names cp0-1, cp0-2, , cp0-48

12
Network diagram
tdgrocks.sci.hkbu.edu.hk
Master node
frontend-0 (192.168.8.1)
Gigibit ethernet switch
Compute node
Compute node
Compute node
comp-pvfs-0-1 (192.168.8.254)
comp-pvfs-0-2 (192.168.8.253)
comp-pvfs-0-48 (192.168.8.207)
13
Login to the master node

Login is allowed remotely in all HKBU networked
PCs by ssh or vncviewer
SSH Login (terminal login)
Using your favourite ssh client software, namely
putty, SSHsecureshell on windows and openssh on
Linux/UNIX
E.g. on all SCI workstations (sc11 sc30), type
ssh tdgrocks.sci.hkbu.edu.hk

14
Login to the master node

VNC Login (graphical login)
Using vncviewer download from http//www.uk.resear
ch.att.com/vnc/
E.g. in sc11 sc30.sci.hkbu.edu.hk,
vncviewer vnc.sci.hkbu.edu.hk51
E.g. in windows, run vncviewer and upon asking
the server address, type
vnc.sci.hkbu.edu.hk51

15
Username and password

The unified password authentication has been
implemented
Same as that of your netware account
Password authentication using NDS-AS
Setup similar to net1 and net4 in ITSC

16
ssh key generation

To make use of multiple nodes in the PC cluster,
users are restricted to use ssh.
Key generation is done once automatically during
first login
You may input a passphrase to protect the key
pair
The key pair is stored in your HOME/.ssh/

17
User Policy

Users are allowed to remote login from other
networked PCs in HKBU.
All users must use their own user account to
login.
The master node (frontend) is used only for
login, simple editing of program source code,
preparing the job dispatching script and
dispatching of jobs to compute node. No
foreground or background can be run on it.
Dispatching of jobs must be done via the OpenPBS
system.

18
OpenPBS system

Provide a fair and efficient job dispatching and
queuing system to the cluster
PBS script shall be written for running job
Either sequential or parallel jobs can be handled
by PBS
Jobs error and output are stored in different
filenames according to job IDs.

19
PBS script example (sequential)
!/bin/bash PBS -l nodes1 PBS -N prime PBS -m
ae PBS -q default the above is the PBS
directive used in batch queue Assume that you
placed the executable in /u1/local/share/pbsexampl
es echo Running on host hostname /u1/local/share
/pbsexamples/prime 216091

PBS scripts are shell script with directives
preceding with PBS
The above example request only 1 node and deliver
the job named prime in default queue.
The PBS system will mail a message after the job
executed.

20
Delivering PBS job

Prepare and compile executable
cp /u1/local/share/pbsexamples/prime.c .
cc o prime prime.c -lm
Prepare and edit PBS script as previous
cp /u1/local/share/pbsexamples/prime.bat .
Submit the job
qsub prime.bat

21
PBS script example (parallel)
!/bin/sh PBS -N cpi PBS -r n PBS -e
cpi.err PBS -o cpi.log PBS -m ae PBS -l
nodes5ppn2 PBS -l walltime010000 This
job's working directory echo Working directory is
PBS_O_WORKDIR cd PBS_O_WORKDIR echo Running on
host hostname echo This jobs runs on the
following processors echo cat PBS_NODEFILE
Define number of processors NPROCSwc -l lt
PBS_NODEFILE echo This job has allocated
NPROCS nodes Run the parallel MPI executable
cpi /u1/local/mpich-1.2.5/bin/mpirun -v
-machinefile PBS_NODEFILE -np NPROCS
/u1/local/share/pbsexamples/cpi
22
Delivering parallel jobs

Copy the PBS script examples
cp /u1/local/share/pbsexamples/runcpi .
Submit the PBS job
qsub runcpi
Note the error and output files named
cpi.e??? and cpi.o???

23
End of Part 1

Thank you!

24
Demonstration of software installed

SPRNG
BLAS, ScaLAPACK
MPITB for MATLAB
NAMD2 and VMD
GAMESS, GROMACS
PGI Compilers for parallel programming
More

25
SPRNG 2.0a

Scalable Parallel Pseudo Random Number Generators
Library
A set of libraries for scalable and portable
pseudorandom number generation.
Most suitable for parallel Monte-Carlo simulation
The current version is installed in
/u1/local/sprng2.0a
For serial source code (e.g. mcErf.c), compile
with
gcc -c -I /u1/local/sprng2.0/include mcErf.c
gcc -o mcErf -L /u1/local/sprng2.0/lib mcErf.o
-lsprng lm
For parallel source code (e.g. mcErf-mpi.c,
compile with
mpicc -c -I /u1/local/sprng2.0/include
mcErf-mpi.c
mpicc -o mcErf-mpi -L /u1/local/sprng2.0/lib
mcErf-mpi.o \
-lsprng lpmpich lmpich lm
Or use Makefile to automate the above process
Samples file mcPi.tar.gz, mcErf.tar.gz can be
found in /u1/local/share/example/sprng/ in the
cluster
Thanks for Mr. K.I. Liu for providing
documentations and samples for SPRNG in
http//www.math.hkbu.edu.hk/kiliu.
More information can be found in
http//sprng.cs.fsu.edu/.

26
BLAS

Basic Linear Algebra Subprograms
basic vector and matrix operations
Sample code showing the speed of BLAS
matrix-matrix multiplication against self written
for loop in /u1/local/share/example/blas
dgemm.c, makefile.dgemm.c,
dgemm-mpi.c, makefile.dgemm-mpi
Thanks Mr. C.W. Yeung, MATH for providing the
above example
Further information can be found in
http//www.netlib.org/blas/

27
ScaLAPACK

Scalable LAPACK
PBLAS BLACS
PBLAS Parallel Basic Linear Algebra Subprograms
BLACS Basic Linear Algebra Communication
Subprograms
Support MPI and PVM
Only MPI version can be found in our cluster
Directories for BLAS, BLACS and ScaLAPACK
libraries
/u1/local/ATLAS/lib/Linux_P4SSE2_2/
/u1/local/BLACS/LIBS
/u1/local/SCALAPACK/libscalapack.a
PBLAS and ScaLAPACK examples (pblas.tgz,
scaex.tgz) are stored in /u1/local/share/example/s
calapack
Further information for ScaLAPACK can be found in
http//www.netlib.org/scalapack/scalapack_home.htm
l
Please ask Morris for further information.

28
MPITB for MATLAB

MPITB example
MC.tar.gz in /u1/local/share/example/mpitb
Untar the example in your home
tar xzvf /u1/local/share/example/mpitb/MC.tar.gz
Lamboot first then start matlab and run
qsub runMCpbs.bat
Further information can be found in
http//www.sci.hkbu.edu.hk/smlam/tdgc/MPITB
Thanks Tammy Lam, MATH for providing the above
homepage and examples

29
NAMD2

parallel, object oriented molecular dynamics code
high-performance simulation of large biomolecular
systems
binary downloaded and installed in
/u1/local/namd2/
work with VMD frontend (GUI)
Demonstration of VMD and NAMD using alanin.zip in
/u1/local/example/namd2
Further information can be found in
http//www.ks.uiuc.edu/Research/namd
Ask Morris Law for further information

30
(No Transcript)
31
GAMESS

The General Atomic and Molecular Electronic
Structure System
A general ab initio quantum chemistry package
Thanks for Justin Lau, CHEM for providing sample
scripts and explanation of the Chemistry behind.

32
PGI compilers support 3 types of parallel
programming

Automatic shared-memory parallel
Use in SMP within the same node (max NCPUS2)
Using the option -Mconcur in pgcc, pgCC, pgf77,
pgf90
pgcc o pgprime Mconcur prime.c
export NCPUS2
./pgprime
User-directed shared-memory parallel
Use in SMP within the same node (max NCPUS2)
Using the option -mp in pgcc, pgCC, pgf77, pgf90
pgf90 o f90prime Mconcur prime.f90
export NCPUS2
./f90prime
User should understand OpenMP parallelization
directives for Fortran and pragmas for C and C
Consult PGI Workstations user guide for details
http//www.pgroup.com/ppro_docs/pgiws_ug/pgiug_.ht
m

33
PGI compilers support 3 types of parallel
programming

Data parallel shared- or distribute-memory
parallel
Only HPF support
suitable in SMP and cluster environment
pghpf o hello hello.hpf
./hello pghpf np 8 stat alls
PGHPF environmental variables
PGHPF_RSHssh
PGHPF_HOSTcp0-1,cp0-2,
PGHPF_STATalls (can be cpu, mem, all, etc)
PGHPF_NP (max no.16, license limit)
Example files in /u1/local/share/example/hpf
hello.tar.gz, pde1.tar.gz
Consult PGHPF user guide in http//www.pgroup.com/
ppro_docs/pghpf_ug/hpfug.htm

34
Other software in considerations

PGAPACK
Parallel Genetic Algorithm Package
/u1/local/pga
PETSc
the Portable, Extensible Toolkit for Scientific
computation
Any suggestions

35
End of Part 2

Thank you!

36
What is Message Passing Interface (MPI)?

Portable standard for communication
Processes can communicate through messages.
Each process is a separable program
All data is private

37
What is Message Passing Interface (MPI)?

This is a library, not a language!!
Different compilers, but all must use the same
libraries, i.e. MPICH, LAM, etc.
There are two versions now, MPI-1 and MPI-2
Use standard sequential language. Fortran, C, etc.

38
Basic Idea of Message Passing Interface (MPI)

MPI Environment - Initialize, manage, and
terminate communication among processes
Communication between processes
global communication, i.e. broadcast, gather,
etc.
point to point communication, i.e. send, receive,
etc.
Complicated data structures
i.e. matrices and memory

39
Is MPI Large or Small?

MPI is large
More than one hundred functions
But not necessarily a measure of complexity
MPI is small
Many parallel programs can be written with just 6
basic functions
MPI is just right
One can access flexibility when it is required
One need not master all MPI functions

40
When Use MPI?

You need a portable parallel program
You are writing a parallel library
You care about performance
You have a problem that can be solved in parallel
ways

41
F77/F90, C/C MPI library calls

Fortran 77/90 uses subroutines
CALL is used to invoke the library call
Nothing is returned, the error code variable is
the last argument
All variables are passed by reference
C/C uses functions
Just the name is used to invoke the library call
The function returns an integer value (an error
code)
Variables are passed by value, unless otherwise
specified

42
Getting started with LAM

Create a file called lamhosts
The content of lamhosts (8 notes)
cp0-1
cp0-2
cp0-3
cp0-8
frontend-0

43
Getting started with LAM

starts LAM on the specified cluster
LAMRSHssh
export LAMRSH
lamboot -v lamhosts
removes all traces of the LAM session on the
network
lamhalt
In the case of a catastrophic failure (e.g., one
or more LAM nodes crash), the lamhalt utility
will hang
LAMRSHssh
export LAMRSH
wipe -v lamhosts

44
Getting started with MPICH

Open the .bashrc under your home directory
Add a path at the end of the file
PATH/u1/local/mpich-1.2.5/bin/u1/local/pgi/linux
86/binPATH
Save and exit
Restart the terminal

45
MPI Commands

mpicc - compiles an mpi program
mpicc -o foo foo.c
mpif77 -o foo foo.f
mpif90 -o foo foo.f90
mpirun - start the execution of mpi programs
mpirun -v -np 2 foo

46
MPI Environment

Initialize - initialize environment
Finalize - terminate environment
Communicator- create default comm. group for all
processes
Version - establish version of MPI
Total processes - spawn total processes
Rank/Process ID - assign identifier to each
process
Timing Functions - MPI_Wtime, MPI_Wtick

47
MPI_INIT

Initializes the MPI environment
Assigns all spawned processes to MPI_COMM_WORLD,
default comm.
C
int MPI_Init(argc,argv)
int argc
char argv
INPUT PARAMETERS
argc - Pointer to the number of arguments
argv - Pointer to the argument vector
Fortran
CALL MPI_INIT(error_code)
int error_code variable that gets set to an
error code

48
MPI_FINALIZE

Terminates the MPI environment
C
int MPI_Finalize()
Fortran
CALL MPI_FINALIZE(error_code)
int error_code variable that gets set to an
error code

49
Hello World 1 (C)

include ltstdio.hgt
include ltmpi.hgt
int main(int argc, char argv)
MPI_Init(argc, argv)
printf(Hello world!\n)
MPI_Finalize()
return(0)

50
Hello World 1 (Fortran)

program main
include 'mpif.h'
integer ierr
call MPI_INIT(ierr)
print , 'Hello world!'
call MPI_FINALIZE(ierr)
end

51
MPI_COMM_SIZE

This finds the number of processes in a
communication group
C
int MPI_Comm_size ( comm, size )
MPI_Comm comm MPI communication group
int size
INPUT PARAMETER
comm - communicator (handle)
OUTPUT PARAMETER
size - number of processes in the group of comm
(integer)
Fortran
CALL MPI_COMM_SIZE(comm, size, error_code)
int error_code variable that gets set to an
error code
Using MPI_COMM_WORLD will return the total number
of processes started

52
MPI_COMM_RANK

This gives the rank/identification number of a
process in a communication group
C
int MPI_Comm_rank ( comm, rank )
MPI_Comm comm
int rank
INPUT PARAMETERS
comm - communicator (handle)
OUTPUT PARAMETER
rank rank/id number of the process who made the
call (integer)
Fortran
CALL MPI_COMM_RANK(comm, rank, error_code)
int error_code variable that gets set to an
error code
Using MPI_COMM_WORLD will return the rank of the
process in relation to all processes that were
started

53
Hello World 2 (C)

include ltstdio.hgt
include ltmpi.hgt
int main(int argc, char argv)
int rank, size
MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, rank)
MPI_Comm_size(MPI_COMM_WORLD, size)
printf(Hello world! I am d of d\n, rank,
size)
MPI_Finalize()
return(0)

54
Hello World 2 (Fortran)

program main
include 'mpif.h'
integer rank, size, ierr
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
print , 'Hello world! I am ', rank, ' of ', size
call MPI_FINALIZE(ierr)
end

55
MPI_Send

standard send
C
int MPI_Send( buf, count, datatype, dest, tag,
comm )
void buf
int count, dest, tag
MPI_Datatype datatype
MPI_Comm comm
INPUT PARAMETERS
buf initial address of send buffer (choice)
count number of elements in send buffer (non
negative integer)
datatype datatype of each send buffer element
(handle)
dest rank of destination (integer)
tag message tag (integer)
comm communicator (handle)
NOTES
This routine may block until the message is
received.

56
MPI_Send

Fortran
MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COM,
IERR)

57
MPI_Recv

basic receive
C
int MPI_Recv( buf, count, datatype, source, tag,
comm, status )
void buf
int count, source, tag
MPI_Datatype datatype
MPI_Comm comm
MPI_Status status
OUTPUT PARAMETERS
buf initial address of receive buffer (choice)
status status object (Status)
INPUT PARAMETERS
count maximum number of elements in receive
buffer (integer)
datatype datatype of each receive buffer element
(handle)
source rank of source (integer)
tag message tag (integer)
comm communicator (handle)

58
MPI_Recv

Fortran
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM,
STATUS, IERR)

59
Timing Functions

MPI_Wtime() - returns a floating point number of
seconds, representing elapsed wall-clock time.
MPI_Wtick() - returns a double precision number
of seconds between successive clock ticks.
The times returned are local to the
node/processor that made the call.

60
Hello World 3 (C)

/
The root node sends out a message to the next
node in the ring and each node then passes the
message along to the next node. The root node
times how long it takes for the message to get
back to it.
/
includeltstdio.hgt / for input/output
/
includeltmpi.hgt / for mpi routines
/
define BUFSIZE 64 / The size of the messege
being passed /
main( int argc, char argv)
double start,finish
int my_rank / the rank of
this process /
int n_processes / the total
number of processes /
char bufBUFSIZE / a buffer for
the messege /
int tag0 / not
important here /
float totalTime0 / for timing
information /
MPI_Status status / not
important here /

61
Hello World 3 (C)

/
If this process is the root process send a
messege to the next node
and wait to recieve one from the last node.
Time how long it takes
for the messege to get around the ring. If
this process is not the
root node, wait to recieve a messege from the
previous node and
then send it to the next node.
/
startMPI_Wtime()
printf("Hello world! I am d of d\n", my_rank,
n_processes)
if( my_rank 0 )
/ send to the next node /
MPI_Send(buf, BUFSIZE, MPI_CHAR, my_rank1,
tag,
MPI_COMM_WORLD)
/ recieve from the last node /
MPI_Recv(buf, BUFSIZE, MPI_CHAR, n_processes-1,
tag,
MPI_COMM_WORLD, status)

62
Hello World 3 (C)

if( my_rank ! 0)
/ recieve from the previous node /
MPI_Recv(buf, BUFSIZE, MPI_CHAR, my_rank-1,
tag,
MPI_COMM_WORLD, status)
/ send to the next node /
MPI_Send(buf, BUFSIZE, MPI_CHAR,
(my_rank1)n_processes, tag,
MPI_COMM_WORLD)
finishMPI_Wtime()
MPI_Finalize() / I'm done with mpi stuff /
/ Print out the results. /
if (my_rank 0)
printf("Total time used was f seconds\n",
finish-start)
return 0

63
Hello World 3 (Fortran)

C /
C The root node sends out a message to the
next node in the ring and each node then passes
the
C message along to the next node. The root
node times how long it takes for the message to
get back to it.
C /
program main
include 'mpif.h'
double precision start, finish
double precision buf(64)
integer my_rank
integer n_processes
integer ierr
integer BUFSIZE 64
integer tag 2001
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank,
ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, n_processes,
ierr)

64
Hello World 3 (Fortran)

C /
C If this process is the root process send a
messege to the next node
C and wait to recieve one from the last node.
Time how long it takes
C for the messege to get around the ring. If
this process is not the
C root node, wait to recieve a messege from the
previous node and
C then send it to the next node.
C /
start MPI_Wtime()
print , 'Hello world! I am ', my_rank, ' of ',
n_processes
if (my_rank .eq. 0) then
C / send to the next node /
call MPI_SEND(buf, BUFSIZE, MPI_DOUBLE_PRECISION
, my_rank1,
1 tag, MPI_COMM_WORLD, ierr)
C / recieve from the last node /
call MPI_RECV(buf, BUFSIZE, MPI_DOUBLE_PRECISION
, n_processes-1,

65
Hello World 3 (Fortran)

C / recieve from the previous node /
call MPI_RECV(buf, BUFSIZE, MPI_DOUBLE_PRECISION
, my_rank-1, tag,
1 MPI_COMM_WORLD, status, ierr)
C / send to the next node /
call MPI_SEND(buf, BUFSIZE, MPI_DOUBLE_PRECISION
, modulo
1 ((my_rank1), n_processes), tag,
MPI_COMM_WORLD, ierr)
endif
finishMPI_Wtime()
C / Print out the results. /
if (my_rank .eq. 0) then
print , 'Total time used was ', finish-start,
' seconds'
endif
call MPI_FINALIZE(ierr)
end

66
MPI C Datatypes
MPI Datatype C Datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
67
MPI C Datatypes
MPI Datatype C Datatype
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
68
MPI Fortran Datatypes
MPI Datatype Fortran Datatype
MPI_INTEGER INTEGER
MPI_REAL REAL
MPI_DOUBLE_PRECISION DOUBLE PRECISION
MPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER
MPI_BYTE
MPI_PACKED
69
End of Part 3