Cluster Workshop - PowerPoint PPT Presentation

Loading...

PPT – Cluster Workshop PowerPoint presentation | free to download - id: 6bd704-M2RkN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Cluster Workshop

Description:

Title: Introduction to Parallelism Author: Project Assistant Last modified by: Morris Law Created Date: 7/27/2006 7:13:04 AM Document presentation format – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 64
Provided by: ProjectA1
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Cluster Workshop


1
Cluster Workshop
For COMP RPG students 17 May, 2010 High
Performance Cluster Computing Centre
(HPCCC) Faculty of Science Hong Kong Baptist
University
2
Outline
  • Overview of Cluster Hardware and Software
  • Basic Login and Running Program in a job queuing
    system
  • Introduction to Parallelism
  • Why Parallelism
  • Cluster Parallelism
  • Open MP
  • Message Passing Interface
  • Parallel Program Examples
  • Policy for using sciblade.sci.hkbu.edu.hk
  • http//www.sci.hkbu.edu.hk/hpccc/sciblade

2
3
Overview of Cluster Hardware and Software
4
Cluster Hardware
  • This 256-node PC cluster (sciblade) consist of
  • Master node x 2
  • IO nodes x 3 (storage)
  • Compute nodes x 256
  • Blade Chassis x 16
  • Management network
  • Interconnect fabric
  • 1U console KVM switch
  • Emerson Liebert Nxa 120k VA UPS

4
5
Sciblade Cluster
256-node clusters supported by fund from RGC
5
6
Hardware Configuration
  • Master Node
  • Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)
  • 16GB RAM, 73GB x 2 SAS drive
  • IO nodes (Storage)
  • Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)
  • 16GB RAM, 73GB x 2 SAS drive
  • 3TB storage Dell PE MD3000
  • Compute nodes x 256 each with
  • Dell PE M600 blade server w/ Infiniband network
  • 2x Xeon E5430 2.66GHz (Quad Core)
  • 16GB RAM, 73GB SAS drive

6
7
Hardware Configuration
  • Blade Chassis x 16
  • Dell PE M1000e
  • Each hosts 16 blade servers
  • Management Network
  • Dell PowerConnect 6248 (Gigabit Ethernet) x 6
  • Inerconnect fabric
  • Qlogic SilverStorm 9120 switch
  • Console and KVM switch
  • Dell AS-180 KVM
  • Dell 17FP Rack console
  • Emerson Liebert Nxa 120kVA UPS

7
8
Software List
  • Operating System
  • ROCKS 5.1 Cluster OS
  • CentOS 5.3 kernel 2.6.18
  • Job Management System
  • Portable Batch System
  • MAUI scheduler
  • Compilers, Languages
  • Intel Fortran/C/C Compiler for Linux V11
  • GNU 4.1.2/4.4.0 Fortran/C/C Compiler

8
9
Software List
  • Message Passing Interface (MPI) Libraries
  • MVAPICH 1.1
  • MVAPICH2 1.2
  • OPEN MPI 1.3.2
  • Mathematic libraries
  • ATLAS 3.8.3
  • FFTW 2.1.5/3.2.1
  • SPRNG 2.0a(C/Fortran) /4.0(C/Fortran)

9
10
Software List
  • Molecular Dynamics Quantum Chemistry
  • Gromacs 4.0.7
  • Gamess 2009R1,
  • Gaussian 03
  • Namd 2.7b1
  • Third-party Applications
  • FDTD simulation
  • MATLAB 2008b
  • TAU 2.18.2, VisIt 1.11.2
  • Xmgrace 5.1.22
  • etc

10
11
Software List
  • Queuing system
  • Torque/PBS
  • Maui scheduler
  • Editors
  • vi
  • emacs

11
12
Hostnames
  • Master node
  • External sciblade.sci.hkbu.edu.hk
  • Internal frontend-0
  • IO nodes (storage)
  • pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2
  • Compute nodes
  • compute-0-0.local, , compute-0-255.local

12
13
Basic Login and Running Program in a Job Queuing
System
14
Basic login
  • Remote login to the master node
  • Terminal login
  • using secure shell
  • ssh -l username sciblade.sci.hkbu.edu.hk
  • Graphical login
  • PuTTY vncviewer e.g.
  • username_at_sciblade vncserver
  • New sciblade.sci.hkbu.edu.hk3 (username)'
    desktop is sciblade.sci.hkbu.edu.hk3
  • It means that your session will run on display 3.

14
15
Graphical login
  • Using PuTTY to setup a secured connection Host
    Namesciblade.sci.hkbu.edu.hk

15
16
Graphical login (cont)
  • ssh protocol version

16
17
Graphical login (cont)
  • Port 5900 display number (i.e. 3 in this case)


17
18
Graphical login (cont)
  • Next, click Open, and login to sciblade
  • Finally, run VNC Viewer on your PC, and enter
    "localhost3" 3 is the display number
  • You should terminate your VNC session after you
    have finished your work. To terminate your VNC
    session running on sciblade, run the command
  • username_at_tdgrocks vncserver kill 3

18
19
Linux commands
  • Both master and compute nodes are installed with
    Linux
  • Frequently used Linux command in PC cluster
    http//www.sci.hkbu.edu.hk/hpccc/sciblade/faq_scib
    lade.php

cp cp f1 f2 dir1 copy file f1 and f2 into directory dir1
mv mv f1 dir1 move/rename file f1 into dir1
tar tar xzvf abc.tar.gz Uncompress and untar a tar.gz format file
tar tar czvf abc.tar.gz abc create archive file with gzip compression
cat cat f1 f2 type the content of file f1 and f2
diff diff f1 f2 compare text between two files
grep grep student search all files with the word student
history history 50 find the last 50 commands stored in the shell
kill kill -9 2036 terminate the process with pid 2036
man man tar displaying the manual page on-line
nohup nohup runmatlab a run matlab (a.m) without hang up after logout
ps ps -ef find out all process run in the systems
sort sort -r -n studno sort studno in reverse numerical order
19
20
ROCKS specific commands
  • ROCKS provides the following commands for users
    to run programs in all compute node. e.g.
  • cluster-fork
  • Run program in all compute nodes
  • cluster-fork ps
  • Check user process in each compute node
  • cluster-kill
  • Kill user process at one time
  • tentakel
  • Similar to cluster-fork but run faster

20
21
Ganglia
  • Web based management and monitoring
  • http//sciblade.sci.hkbu.edu.hk/ganglia

21
22
Why Parallelism
23
Why Parallelism Passively
  • Suppose you are using the most efficient
    algorithm with an optimal implementation, but the
    program still takes too long or does not even fit
    onto your machine
  • Parallelization is the last chance.

23
24
Why Parallelism Initiative
  • Faster
  • Finish the work earlier
  • Same work in shorter time
  • Do more work
  • More work in the same time
  • Most importantly, you want to predict the result
    before the event occurs

24
25
Examples
  • Many of the scientific and engineering problems
    require enormous computational power.
  • Following are the few fields to mention
  • Quantum chemistry, statistical mechanics, and
    relativistic physics
  • Cosmology and astrophysics
  • Computational fluid dynamics and turbulence
  • Material design and superconductivity
  • Biology, pharmacology, genome sequencing, genetic
    engineering, protein folding, enzyme activity,
    and cell modeling
  • Medicine, and modeling of human organs and bones
  • Global weather and environmental modeling
  • Machine Vision

25
26
Parallelism
  • The upper bound for the computing power that can
    be obtained from a single processor is limited by
    the fastest processor available at any certain
    time.
  • The upper bound for the computing power available
    can be dramatically increased by integrating a
    set of processors together.
  • Synchronization and exchange of partial results
    among processors are therefore unavoidable.

26
27
Multiprocessing Clustering
Parallel Computer Architecture
Distributed Memory Cluster
Shared Memory Symmetric multiprocessors (SMP)
27
28
Clustering Pros and Cons
  • Advantages
  • Memory scalable to number of processors.
  • ?Increase number of processors, size of
    memory and bandwidth as well.
  • Each processor can rapidly access its own memory
    without interference
  • Disadvantages
  • Difficult to map existing data structures to this
    memory organization
  • User is responsible for sending and receiving
    data among processors

28
29
TOP500 Supercomputer Sites (www.top500.org)
29
30
Cluster Parallelism
31
Parallel Programming Paradigm
  • Multithreading
  • OpenMP
  • Message Passing
  • MPI (Message Passing Interface)
  • PVM (Parallel Virtual Machine)

Shared memory only
Shared memory, Distributed memory
31
32
Distributed Memory
  • Programmers view
  • Several CPUs
  • Several block of memory
  • Several threads of action
  • Parallelization
  • Done by hand
  • Example
  • MPI

32
33
Message Passing Model
Message Passing The method by which data from one
processor's memory is copied to the memory of
another processor.
Process A process is a set of executable
instructions (program) which runs on a processor.
Message passing systems generally associate only
one process per processor, and the terms
"processes" and "processors" are used
interchangeably
33
34
OpenMP
35
OpenMP Mission
  • The OpenMP Application Program Interface (API)
    supports multi-platform shared-memory parallel
    programming in C/C and Fortran on all
    architectures, including Unix platforms and
    Windows NT platforms.
  • Jointly defined by a group of major computer
    hardware and software vendors.
  • OpenMP is a portable, scalable model that gives
    shared-memory parallel programmers a simple and
    flexible interface for developing parallel
    applications for platforms ranging from the
    desktop to the supercomputer.

35
36
OpenMP compiler choice
  • gcc 4.40 or above
  • compile with -fopenmp
  • Intel 10.1 or above
  • compile with Qopenmp on Windows
  • compile with openmp on linux
  • PGI compiler
  • compile with mp
  • Absoft Pro Fortran
  • compile with -openmp

36
37
Sample openmp example
  • include ltomp.hgt
  • include ltstdio.hgt
  • int main()
  • pragma omp parallel printf("Hello from thread
    d, nthreads d\n", omp_get_thread_num(),
    omp_get_num_threads())

37
38
serial-pi.c
  • include ltstdio.hgt
  • static long num_steps 10000000
  • double step
  • int main ()
  • int i double x, pi, sum 0.0
  • step 1.0/(double) num_steps
  • for (i0ilt num_steps i)
  • x (i0.5)step
  • sum sum 4.0/(1.0xx)
  • pi step sum
  • printf("Est Pi f\n",pi)

38
39
Openmp version of spmd-pi.c
  • include ltomp.hgt
  • include ltstdio.hgt
  • static long num_steps 10000000
  • double step
  • define NUM_THREADS 8
  • int main ()
  • int i, nthreads double pi, sumNUM_THREADS
  • step 1.0/(double) num_steps
  • omp_set_num_threads(NUM_THREADS)
  • pragma omp parallel
  • int i, id,nthrds
  • double x
  • id omp_get_thread_num()
  • nthrds omp_get_num_threads()
  • if (id 0) nthreads nthrds
  • for (iid, sumid0.0ilt num_steps
    iinthrds)
  • x (i0.5)step
  • sumid 4.0/(1.0xx)

39
40
Message Passing Interface (MPI)
41
MPI
  • Is a library but not a language, for parallel
    programming
  • An MPI implementation consists of
  • a subroutine library with all MPI functions
  • include files for the calling application program
  • some startup script (usually called mpirun, but
    not standardized)
  • Include the lib file mpi.h (or however called)
    into the source code
  • Libraries available for all major imperative
    languages (C, C, Fortran )

41
42
General MPI Program Structure
42
43
Sample Program Hello World!
  • In this modified version of the "Hello World"
    program, each processor prints its rank as well
    as the total number of processors in the
    communicator MPI_COMM_WORLD.
  • Notes
  • Makes use of the pre-defined communicator
    MPI_COMM_WORLD.
  • Not testing for error status of routines!

43
44
Sample Program Hello World!
include ltstdio.hgt include mpi.h
// MPI compiler header file void
main(int argc, char argv) int nproc,myrank,i
err ierrMPI_Init(argc,argv)
// MPI initialization // Get number of MPI
processes MPI_Comm_size(MPI_COMM_WORLD,nproc)
// Get process id for this
processor MPI_Comm_rank(MPI_COMM_WORLD,myrank)
printf (Hello World!! Im process d of
d\n,myrank,nproc) ierrMPI_Finalize()
// Terminate all MPI
processes
44
45
Performance
  • When we write a parallel program, it is important
    to identify the fraction of the program that can
    be parallelized and to maximize it.
  • The goals are
  • load balance
  • memory usage balance
  • minimize communication overhead
  • reduce sequential bottlenecks
  • scalability

45
46
Compiling Running MPI Programs
  • Using mvapich 1.1
  • Setting path, at the command prompt, type
  • export PATH/u1/local/mvapich1/binPATH
  • (uncomment this line in .bashrc)
  • Compile using mpicc, mpiCC, mpif77 or mpif90,
    e.g.
  • mpicc o cpi cpi.c
  • Prepare hostfile (e.g. machines) number of
    compute nodes
  • Compute-0-0
  • Compute-0-1
  • Compute-0-2
  • Compute-0-3
  • Run the program with a number of processor node
  • mpirun np 4 machinefile machines ./cpi

46
47
Compiling Running MPI Programs
  • Using mvapich 1.2
  • Prepare .mpd.conf and .mpd.passwd and saved in
    your home directory
  • MPD_SECRETWORDgde1234-3
  • (you may set your own secret word)
  • Setting environment for mvapich 1.2
  • export MPD_BIN/u1/local/mvapich2
  • export PATHMPD_BINPATH
  • (uncomment this line in .bashrc)
  • Compile using mpicc, mpiCC, mpif77 or mpif90,
    e.g.
  • mpicc o cpi cpi.c
  • Prepare hostfile (e.g. machines) one hostname
    per line like previous section

47
48
Compiling Running MPI Programs
  • Pmdboot with the hostfile
  • mpdboot n 4 f machines
  • Run the program with a number of processor node
  • mpiexec np 4 ./cpi
  • Remember to clean after running jobs by
    mpdallexit
  • mpdallexit

48
49
Compiling Running MPI Programs
  • Using openmpi1.2
  • Setting environment for openmpi
  • export LD-LIBRARY_PATH/u1/local/openmpi/
  • libLD-LIBRARY_PATH
  • export PATH/u1/local/openmpi/binPATH
  • (uncomment this line in .bashrc)
  • Compile using mpicc, mpiCC, mpif77 or mpif90,
    e.g.
  • mpicc o cpi cpi.c
  • Prepare hostfile (e.g. machines) one hostname
    per line like previous section
  • Run the program with a number of processor node
  • mpirun np 4 machinefile machines ./cpi

49
50
Submit parallel jobs into torque batch queue
  • Prepare a job script, say omp.pbs like the
    following
  • !/bin/sh
  • Job name
  • PBS -N OMP-spmd
  • Declare job non-rerunable
  • PBS -r n
  • Mail to user
  • PBS -m ae
  • Queue name (small, medium, long, verylong)
  • Number of nodes
  • PBS -l nodes1ppn8
  • PBS -l walltime000800
  • cd PBS_O_WORKDIR
  • export OMP_NUM_THREADS8
  • ./omp-test
  • ./serial-pi
  • ./omp-spmd-pi
  • Submit it using qsub
  • qsub omp.pbs

50
51
Another example of pbs scripts
  • Prepare a job script, say scripts.sh like the
    following
  • !/bin/sh
  • Job name
  • PBS -N Sorting
  • Declare job non-rerunable
  • PBS -r n
  • Number of nodes
  • PBS -l nodes4
  • PBS -l walltime080000
  • This job's working directory
  • echo Working directory is PBS_O_WORKDIR
  • cd PBS_O_WORKDIR
  • echo Running on host hostname
  • echo Time is date
  • echo Directory is pwd
  • echo This jobs runs on the following processors
  • echo cat PBS_NODEFILE
  • Define number of processors
  • NPROCSwc -l lt PBS_NODEFILE

51
52
Parallel Program Examples
53
Example 1 Estimation of Pi (OpenMP)
  • include ltomp.hgt
  • include ltstdio.hgt
  • static long num_steps 10000000
  • double step
  • define NUM_THREADS 8
  • int main ()
  • int i, nthreads double pi, sumNUM_THREADS
  • step 1.0/(double) num_steps
  • omp_set_num_threads(NUM_THREADS)
  • pragma omp parallel
  • int i, id,nthrds
  • double x
  • id omp_get_thread_num()
  • nthrds omp_get_num_threads()
  • if (id 0) nthreads nthrds
  • for (iid, sumid0.0ilt num_steps
    iinthrds)
  • x (i0.5)step
  • sumid 4.0/(1.0xx)

53
54
Example 2a Sorting quick sort
  • The quick sort is an in-place, divide-and-conquer,
    massively recursive sort.
  • The efficiency of the algorithm is majorly
    impacted by which element is chosen as the pivot
    point.
  • The worst-case efficiency of the quick sort,
    O(n2), occurs when the list is sorted and the
    left-most element is chosen.
  • If the data to be sorted isn't random, randomly
    choosing a pivot point is recommended. As long as
    the pivot point is chosen randomly, the quick
    sort has an algorithmic complexity of O(n log n).
  • Pros Extremely fast.
  • Cons Very complex algorithm, massively recursive

54
55
Quick Sort Performance
Processes Time
1 0.410000
2 0.300000
4 0.180000
8 0.180000
16 0.180000
32 0.220000
64 0.680000
128 1.300000
55
56
Example 2b Sorting -Bubble Sort
  • The bubble sort is the oldest and simplest sort
    in use. Unfortunately, it's also the slowest.
  • The bubble sort works by comparing each item in
    the list with the item next to it, and swapping
    them if required.
  • The algorithm repeats this process until it makes
    a pass all the way through the list without
    swapping any items (in other words, all items are
    in the correct order).
  • This causes larger values to "bubble" to the end
    of the list while smaller values "sink" towards
    the beginning of the list.

56
57
Bubble Sort Performance
Processes Time
1 3242.327
2 806.346
4 276.4646
8 78.45156
16 21.031
32 4.8478
64 2.03676
128 1.240197
57
58
Monte Carlo Integration
  • "Hit and miss" integration
  • The integration scheme is to take a large number
    of random points and count the number that are
    within f(x) to get the area

58
59
Monte Carlo Integration
  • Monte Carlo Integration to Estimate Pi

59
60
Example 2 Prime prime/prime.c prime/prime.f90 pr
ime/primeParallel.c prime/Makefile prime/machines
Compile by the command make Run the serial
program by ./primeC or ./primeF Run the
parallel program by mpirun np 4 machinefile
machines ./primeMPI
Example 1 omp omp/test-omp.c omp/serial-pi.c omp
/spmd-pi.c Compile program by the command
make Run the program in parallel
by ./omp-spmd-pi Submit job to PBS by qsub
omp.pbs
Example 4 pmatlab pmatlab/startup.m pmatlab/RUN.
m pmatlab/sample-pi.m Submit job to PBS by qsub
Qpmatlab.pbs
Example 3 Sorting sorting/qsort.c sorting/bubble
sort.c sorting/script.sh sorting/qsort
sorting/bubblesort Submit job to PBS queuing
system by qsub script.sh
60
61
Policy for using sciblade.sci.hkbu.edu.hk
62
Policy
  1. Every user shall apply for his/her own computer
    user account to login to the master node of the
    PC cluster, sciblade.sci.hkbu.edu.hk.
  2. The account must not be shared his/her account
    and password with the other users.
  3. Every user must deliver jobs to the PC cluster
    from the master node via the PBS job queuing
    system. Automatically dispatching of job using
    scripts or robots are not allowed.
  4. Users are not allowed to login to the compute
    nodes.
  5. Foreground jobs on the PC cluster are restricted
    to program testing and the time duration should
    not exceed 1 minutes CPU time per job.

63
Policy (continue)
  • Any background jobs run on the master node or
    compute nodes are strictly prohibited and will be
    killed without prior notice.
  • The current restrictions of the job queuing
    system are as follows,
  • The maximum number of running jobs in the job
    queue is 8.
  • The maximum total number of CPU cores used in one
    time cannot exceed 512.
  • The restrictions in item 7 will be reviewed
    timely for the growing number of users and the
    computation need.
About PowerShow.com