Point Power CDROM . - PowerPoint PPT Presentation

1 / 101

About This Presentation

Title:

Point Power CDROM .

Description:

NASA Goddard Space Flight Center. NASA JPL, Caltech, academic and industrial collaborators ... Workstation clusters are a cheap and readily available alternative to ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 102

Provided by: guyte

Category:

more less

Transcript and Presenter's Notes

Title: Point Power CDROM .

1
???? ?????? ??????

????? ??' 10
24/12/2001

2
????? ??? ??' 3

???? ????? ?? ???? ?' ?- 27/12/2001

3
???????? ???

?????? 1-10 ??????? ????? ?? ?????? ???? ??????
???? ???????.
?? ?????? ?? ???? ?????? ?????? Point Power ????
?????? ?? ???? ?????? ?? CDROM ????.

4
??????

????? ?????? ?????? ?? ???? ?'.
??????? ??????? ?????? ???.

5
????? ??????

Todays topics
Shared Memory
Cilk, OpenMP
MPI Derived Data Types
How to Build a Beowulf

6
Shared Memory

Goto PDF presentation
Chapter 8 from Wilkinson Allans book.
Programming with Shared Memory

7
Summary

Process creation
The thread concept
Pthread routines
How data can be created as shared
Condition Variables
Dependency analysis Bernsteins conditions

8
Cilk

http//supertech.lcs.mit.edu/cilk

9
Cilk

A language for multithreaded parallel programming
based on ANSI C.
Cilk is designed for general-purpose parallel
programming language
Cilk is especially effective for exploiting
dynamic, highly asynchronous parallelism.

10
A serial C program to compute the nth Fibonacci
number.
11
A parallel Cilk program to compute the nth
Fibonacci number.
12
Cilk - continue

Compiling
cilk -O2 fib.cilk -o fib
Executing
fib --nproc 4 30

13
OpenMP
Next 5 slides taken from the SC99 tutorial Given
by Tim Mattson, Intel Corporation and Rudolf
Eigenmann, Purdue University
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
?????? ?????

High-Performance Computing
Part III
Shared Memory Parallel Processors

20
Back to MPI
21
Collective Communication
Broadcast
22
Collective Communication
Reduce
23
Collective Communication
Gather
24
Collective Communication
Allgather
25
Collective Communication
Scatter
26
Collective Communication
There are more collective communication commands
27
?????? ??????? ?- MPI

MPI Derived Data Types
MPI-2 Parallel I/O

28
User Defined Types

???? ?- types ???????? ????, ???? ?????? ?????
??????? ?????
Compact pack/unpack.

29
Predefined Types
MPI_DOUBLE double MPI_FLOAT float
MPI_INT signed int MPI_LONG signed long
int MPI_LONG_DOUBLE long double
MPI_LONG_LONG_INT signed long long int
MPI_SHORT signed short int MPI_UNSIGNED
unsigned int MPI_UNSIGNED_CHAR unsigned
char MPI_UNSIGNED_LONG unsigned long int
MPI_UNSIGNED_SHORT unsigned short int MPI_BYTE
30
Motivation

What if you want to specify
non-contiguous data of a single type?
contiguous data of mixed types?
non-contiguous data of mixed types?

Derived datatypes save memory, are faster, more
portable, and elegant.
31
3 Steps

Construct the new datatype using appropriate MPI
routinesMPI_Type_contiguous, MPI_Type_vector,
MPI_Type_struct, MPI_Type_indexed,
MPI_Type_hvector, MPI_Type_hindexed
Commit the new datatypeMPI_Type_commit
Use the new datatype in sends/receives, etc.Use

32
includeltmpi.hgt void main(int argc, char
argv) int rank MPI_status status
struct int x int y int z point
MPI_Datatype ptype MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
MPI_Type_contiguous(3,MPI_INT,ptype)
MPI_Type_commit(ptype) if(rank3)
point.x15 point.y23 point.z6
MPI_Send(point,1,ptype,1,52,MPI_COMM_WORLD)
else if(rank1) MPI_Recv(point,1,ptype,3,5
2,MPI_COMM_WORLD,status) printf("Pd received
coords are (d,d,d) \n",rank,point.x,point.y,po
int.z) MPI_Finalize()
33
User Defined Types

MPI_TYPE_STRUCT
MPI_TYPE_CONTIGUOUS
MPI_TYPE_VECTOR
MPI_TYPE_HVECTOR
MPI_TYPE_INDEXED
MPI_TYPE_HINDEXED

34
MPI_TYPE_STRUCT
is the most general way to construct an MPI
derived type because it allows the length,
location, and type of each component to be
specified independently.
int MPI_Type_struct (int count, int
array_of_blocklengths, MPI_Aint
array_of_displacements, MPI_Datatype
array_of_types, MPI_Datatype newtype)
35
Struct Datatype Example
count 2 array_of_blocklengths0
1 array_of_types0 MPI_INT array_of_blocklength
s1 3 array_of_types1 MPI_DOUBLE
36
MPI_TYPE_CONTIGUOUS
is the simplest of these, describing a contiguous
sequence of values in memory. For
example, MPI_Type_contiguous(2,MPI_DOUBLE,MPI_2D_
POINT) MPI_Type_contiguous(3,MPI_DOUBLE,MPI_3D_P
OINT)
int MPI_Type_contiguous(int count, MPI_Datatype
oldtype, MPI_Datatype newtype)
37
MPI_TYPE_CONTIGUOUS
creates new type indicators MPI_2D_POINT and
MPI_3D_POINT. These type indicators allow you to
treat consecutive pairs of doubles as point
coordinates in a 2-dimensional space and
sequences of three doubles as point coordinates
in a 3-dimensional space.
38
MPI_TYPE_VECTOR
describes several such sequences evenly spaced
but not consecutive in memory.
MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR
except that the distance between successive
blocks is specified in bytes rather than elements.
MPI_TYPE_INDEXED describes sequences that may
vary both in length and in spacing.
39
MPI_TYPE_VECTOR
int MPI_Type_vector(int count, int blocklength,
int stride, MPI_Datatype oldtype, MPI_Datatype
newtype)
count 2, blocklength 3, stride 5
40
????? ??????
includeltmpi.hgt void main(int argc, char
argv) int rank,i,j MPI_status status
double x48 MPI_Datatype coltype
MPI_Init(argc, argv) MPI_Comm_rank(MPI_COMM_W
ORLD,rank) MPI_Type_vector(4,1,8,MPI_DOUBLE,col
type) MPI_Type_commit(coltype)
41
if(rank3) for(i0ilt4i)
for(j0jlt8j) xijpow(10.0,i1)j
MPI_Send(x07,1,coltype,1,52,MPI_COMM_WORLD)
else if(rank1) MPI_Recv(x02,1,colty
pe,3,52,MPI_COMM_WORLD,status)
for(i0ilt4i) printf("Pd my
xd21f\n",rank,i,xi2)
MPI_Finalize()
42
????
P1 my x0217.000000 P1 my
x12107.000000 P1 my x221007.000000
P1 my x3210007.000000
43
(No Transcript)
44
Committing a datatype
int MPI_Type_commit (MPI_Datatype datatype)
45
Obtaining Information About Derived Types

MPI_TYPE_LB and MPI_TYPE_UB can provide the lower
and upper bounds of the type.
MPI_TYPE_EXTENT can provide the extent of the
type. In most cases, this is the amount of memory
a value of the type will occupy.
MPI_TYPE_SIZE can provide the size of the type in
a message. If the type is scattered in memory,
this may be significantly smaller than the extent
of the type.

46
MPI_TYPE_EXTENT
MPI_Type_extent (MPI_Datatype datatype, MPI_Aint
extent)
Correction
Deprecated. Use MPI_Type_get_extent instead!
47
Ref Ian Fosters book DBPP
48
MPI-2
MPI-2 is a set of extensions to the MPI standard.
It was finalized by the MPI Forum in June, 1997.

49
MPI-2

New Datatype Manipulation Functions
Info Object
New Error Handlers
Establishing/Releasing Communications
Extended Collective Operations
Thread Support
Fault Tolerant

50
MPI-2 Parallel I/O

Motivation
The ability to parallelize I/O can offer
significant performance improvements.
User-level checkpointing is contained within the
program itself.

51
Parallel I/O

MPI-2 supports both blocking and nonblocking I/O
MPI-2 supports both collective and non-collective
I/O

52
Complementary Filetypes
53
Simple File Scatter/Gather - Problem
54
MPI-2 Parallel I/O

?????? ??????? ????? ??? ????? ?????? ?????
??????
MPI-2 file structure
Initializing MPI-2 File I/O
Defining a View
Data Access - Reading Data
Data Access - Writing Data
Closing MPI-2 file I/O

55
How to Build a Beowulf
56
What is a Beowulf?

A new strategy in High-Performance Computing
(HPC) that exploits mass-market technology to
overcome the oppressive costs in time and money
of supercomputing.

57
What is a Beowulf?

A Collection of personal computers
interconnected by widely available networking
technology running one of several open-source
Unix-like operating systems.

COTS Commodity-off-the-shelf components
Interconnection networks LAN/SAN

Price/Performance
59
How to Run Application Faster

There are 3 ways to improve performance
1. Work Harder
2. Work Smarter
3. Get Help
Computer Analogy
1. Use faster hardware e.g. reduce the time per
instruction (clock cycle).
2. Optimized algorithms and techniques
3. Multiple computers to solve problem That is,
increase no. of instructions executed per clock
cycle.

60
Motivation for using Clusters

The communications bandwidth between workstations
is increasing as new networking technologies and
protocols are implemented in LANs and WANs.
Workstation clusters are easier to integrate into
existing networks than special parallel computers.

61
Beowulf-class SystemsA New Paradigm for the
Business of Computing

Brings high end computing to broad ranged
problems
new markets
Order of magnitude Price-Performance advantage
Commodity enabled
no long development lead times
Low vulnerability to vendor-specific decisions
companies are ephemeral Beowulfs are forever
Rapid response technology tracking
Just-in-place user-driven configuration
requirement responsive
Industry-wide, non-proprietary software
environment

62
Beowulf Project - A Brief History

Started in late 1993
NASA Goddard Space Flight Center
NASA JPL, Caltech, academic and industrial
collaborators
Sponsored by NASA HPCC Program
Applications single user science station
data intensive
low cost
General focus
single user (dedicated) science and engineering
applications
system scalability
Ethernet drivers for Linux

63
Beowulf System at JPL (Hyglac)

16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128
Mbyte memory, Fast Ethernet card.
Connected using 100Base-T network,
through a 16-way crossbar switch.

Theoretical peak performance 3.2 GFlop/s.
Achieved sustained performance 1.26 GFlop/s.

64
Cluster Computing - Research Projects (partial
list)

Beowulf (CalTech and NASA) - USA
Condor - Wisconsin State University, USA
HPVM -(High Performance Virtual Machine),UIUCnow
UCSB,US
MOSIX - Hebrew University of Jerusalem, Israel
MPI (MPI Forum, MPICH is one of the popular
implementations)
NOW (Network of Workstations) - Berkeley, USA
NIMROD - Monash University, Australia
NetSolve - University of Tennessee, USA
PBS (Portable Batch System) - NASA Ames and LLNL,
USA
PVM - Oak Ridge National Lab./UTK/Emory, USA

65
Motivation for using Clusters

Surveys show utilisation of CPU cycles of desktop
workstations is typically lt10.
Performance of workstations and PCs is rapidly
improving
As performance grows, percent utilisation will
decrease even further!
Organisations are reluctant to buy large
supercomputers, due to the large expense and
short useful life span.

66
Motivation for using Clusters

The development tools for workstations are more
mature than the contrasting proprietary solutions
for parallel computers - mainly due to the
non-standard nature of many parallel systems.
Workstation clusters are a cheap and readily
available alternative to specialised High
Performance Computing (HPC) platforms.
Use of clusters of workstations as a distributed
compute resource is very cost effective -
incremental growth of system!!!

67
Original Food Chain Picture
68
1984 Computer Food Chain
Mainframe
PC
Workstation
Mini Computer
Vector Supercomputer
69
1994 Computer Food Chain
(hitting wall soon)
Mini Computer
PC
Workstation
Mainframe
(future is bleak)
Vector Supercomputer
MPP
70
Computer Food Chain (Now and Future)
71
Parallel Computing
Cluster Computing
MetaComputing
Tightly Coupled
Vector
Pile of PCs
NOW/COW
WS Farms/cycle harvesting
Beowulf
NT-PC Cluster
DASHMEM-NUMA
72
PC Clusters small, medium, large
73
(No Transcript)
74
Computing Elements
75
Networking

Topology
Hardware
Cost
Performance

76
Cluster Building Blocks
77
Channel Bonding
78
Myrinet
Myrinet 2000 switch
Myrinet 2000 NIC
79
Example 320-host Clos topology of 16-port
switches
64 hosts
64 hosts
64 hosts
64 hosts
64 hosts
(From Myricom)
80
Myrinet

Full-duplex 22 Gigabit/second data rate links,
switch ports, and interface ports.
Flow control, error control, and "heartbeat"
continuity monitoring on every link.
Low-latency, cut-through, crossbar switches, with
monitoring for high-availability applications.
Switch networks that can scale to tens of
thousands of hosts, and that can also provide
alternative communication paths between hosts.
Host interfaces that execute a control program to
interact directly with host processes ("OS
bypass") for low-latency communication, and
directly with the network to send, receive, and
buffer packets.

81
Myrinet

Sustained one-way data rate for large messages
1.92mbps
Latency for short messages 9msec.

82
Gigabit Ethernet
Cajun 550
Cajun M770
Cajun P882
Switches by 3COM and Avaya
83
(No Transcript)
84
Network Topology
85
Network Topology
86
Network Topology
87
Topology of the Velocity Cluster at CTC
88
Software all this list for free!