CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle

Description:

Overview of parallel machines and programming models. Shared memory ... Software is or should be free (Stallman) All source code is 'open' Everyone is a tester ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 44

Provided by: kathyy151

Category:

more less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle

1
CS 267 Applications of Parallel
ComputersLecture 4Introduction to Parallel
Computers and Parallel Programming Methodologies

Horst D. Simon
http//www.cs.berkeley.edu/strive/cs267

2
Outline

Overview of parallel machines and programming
models
Shared memory
Shared address space
Message passing
Data parallel
Clusters of SMPs
Trends in real machines

3
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory

Where is the memory physically located?

4
Parallel Programming Models

Control
How is parallelism created?
What orderings exist between operations?
How do different threads of control synchronize?
Data
What data is private vs. shared?
How is logically shared data accessed or
communicated?
Operations
What are the atomic operations?
Cost
How do we account for the cost of each of the
above?

5
Simple Example

Consider a sum of an array function
Parallel Decomposition
Each evaluation and each partial sum is a task.
Assign n/p numbers to each of p procs
Each computes independent private results and
partial sum.
One (or all) collects the p partial sums and
computes the global sum.
Two Classes of Data
Logically Shared
The original n numbers, the global sum.
Logically Private
The individual function evaluations.
What about the individual partial sums?

6
Programming Model 1 Shared Memory

Program is a collection of threads of control.
Many languages allow threads to be created
dynamically, I.e., mid-execution.
Each thread has a set of private variables, e.g.
local variables on the stack.
Collectively with a set of shared variables,
e.g., static variables, shared common blocks,
global heap.
Threads communicate implicitly by writing and
reading shared variables.
Threads coordinate using synchronization
operations on shared variables

Address
x ...
Shared
y ..x ...
Private
i res s
7
Machine Model 1a Shared Memory

Processors all connected to a large shared
memory.
Typically called Symmetric Multiprocessors (SMPs)
Sun, HP, Intel, IBM SMPs (nodes of Millennium,
SP)
Local memory is not (usually) part of the
hardware.
Cost much cheaper to access data in cache than
in main memory.
Difficulty scaling to large numbers of processors
lt32 processors typical
Advantage uniform memory access (UMA)

8
Example Problem in Scaling Shared Memory
From Pat Worley, ORNL
9
Example Problem in Scaling Shared Memory
From Pat Worley, ORNL
10
Machine Model 1b Distributed Shared Memory

Memory is logically shared, but physically
distributed
Any processor can access any address in memory
Cache lines (or pages) are passed around machine
SGI Origin is canonical example ( research
machines)
Scales to 100s
Limitation is cache consistency protocols need
to keep cached copies of the same address
consistent

P1
Pn
network
memory
memory
memory
11
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
What is the problem?

A race condition or data race occurs when
two processors (or two threads) access the same
variable, and at least one does a write.
The accesses are concurrent (not synchronized)

12
Pitfalls and Solution via Synchronization

Pitfall in computing a global sum s s
local_si

Thread 1 (initially s0) load s from mem to
reg s slocal_s1 local_s1, in reg
store s from reg to mem
Thread 2 (initially s0) load s from
mem to reg initially 0 s slocal_s2
local_s2, in reg store s from reg to mem
Time

Instructions from different threads can be
interleaved arbitrarily.
One of the additions may be lost
Possible solution mutual exclusion with locks

Thread 1 lock load s s slocal_s1
store s unlock
Thread 2 lock load s s slocal_s2
store s unlock
13
Programming Model 2 Message Passing

Program consists of a collection of named
processes.
Usually fixed at program startup time
Thread of control plus local address space -- NO
shared data.
Logically shared data is partitioned over local
processes.
Processes communicate by explicit send/receive
pairs
Coordination is implicit in every communication
event.
MPI is the most common example

A
A
n
0
14
Computing s x(1)x(2) on each processor

First possible solution

Processor 2 receive xremote, proc1 send
xlocal, proc1 xlocal x(2) s
xlocal xremote
Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote

Second possible solution -- what could go wrong?

Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
Processor 2 send xlocal, proc1
xlocal x(2) receive xremote, proc1 s
xlocal xremote

What if send/receive acts like the telephone
system? The post office?

15
MPI the de facto standard

In 2002 MPI has become the de facto standard for
parallel computing
The software challenge overcoming the MPI
barrier
MPI created finally a standard for applications
development in the HPC community
Standards are always a barrier to further
development
The MPI standard is a least common denominator
building on mid-80s technology
Programming Model reflects hardware!

I am not sure how I will program a Petaflops
computer, but I am sure that I will need MPI
somewhere HDS 2001
16
Machine Model 2 Distributed Memory

Cray T3E, IBM SP, Millennium.
Each processor is connected to its own memory and
cache but cannot directly access another
processors memory.
Each node has a network interface (NI) for all
communication and synchronization.

17
PC Clusters Contributions of Beowulf

An experiment in parallel computing systems
Established vision of low cost, high end
computing
Demonstrated effectiveness of PC clusters for
some (not all) classes of applications
Provided networking software
Conveyed findings to broad community (great PR)
Tutorials and book
Design standard to rally
community!
Standards beget
books, trained people,
software virtuous cycle

Adapted from Gordon Bell, presentation at
Salishan 2000
18
Linuss Law Linux Everywhere

Software is or should be free (Stallman)
All source code is open
Everyone is a tester
Everything proceeds a lot faster when everyone
works on one code (HPC nothing gets done if
resources are scattered)
Anyone can support and market the code for any
price
Zero cost software attracts users!
All the developers write lots of code
Prevents community from losing HPC software (CM5,
T3E)

19
Commercially Integrated Tflop/s Clusters Are
Happening today

Shell largest engineering/scientific cluster
NCSA 1024 processor cluster (IA64)
Univ. Heidelberg cluster
PNNL announced 8 Tflops (peak) IA64 cluster from
HP with Quadrics interconnect
DTF in US announced 4 clusters for a total of 13
Teraflops (peak)

But make no mistake Itanium and McKinley are
not a commodity product
20
Internet Computing- SETI_at_home

Running on 500,000 PCs, 1000 CPU Years per Day
485,821 CPU Years so far
Sophisticated Data Signal Processing Analysis
Distributes Datasets from Arecibo Radio Telescope

Next Step- Allen Telescope Array
21
Programming Model 2b Global Addr Space

Program consists of a collection of named
processes.
Usually fixed at program startup time
Local and shared data, as in shared memory model
But, shared data is partitioned over local
processes
Remote data stays remote on distributed memory
machines
Processes communicate by writes to shared
variables
Explicit synchronization needed to coordinate
UPC, Titanium, Split-C are some examples
Global Address Space programming is an
intermediate point between message passing and
shared memory
Most common on a the Cray T3E, which had some
hardware support for remote reads/writes

22
Programming Model 3 Data Parallel

Single thread of control consisting of parallel
operations.
Parallel operations applied to all (or a defined
subset) of a data structure, usually an array
Communication is implicit in parallel operators
Elegant and easy to understand and reason about
Coordination is implicit statements executed
synchronousl
Drawbacks
Not all problems fit this model
Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
23
Machine Model 3a SIMD System

A large number of (usually) small processors.
A single control processor issues each
instruction.
Each processor executes the same instruction.
Some processors may be turned off on some
instructions.
Machines are not popular (CM2), but programming
model is.

control processor
. . .
interconnect

Implemented by mapping n-fold parallelism to p
processors.
Mostly done in the compilers (e.g., HPF).

24
Model 3b Vector Machines

Vector architectures are based on a single
processor
Multiple functional units
All performing the same operation
Instructions may specific large amounts of
parallelism (e.g., 64-way) but hardware executes
only a subset in parallel
Historically important
Overtaken by MPPs in the 90s
Still visible as a processor architecture within
an SMP

25
Earth Simulator Architecture Optimizing for the
full range of tasks

Parallel Vector Architecture
High speed (vector) processors
High memory bandwidth (vector architecture)
Fast network (new crossbar switch)

Rearranging commodity parts cant match this
performance
26
Earth Simulator Configuration of a General
Purpose Supercomputer

640 nodes
8 vector processors of 8 GFLOPS and 16GB shared
memories per node.
Total of 5,120 processors
Total 40 Tflop/s peak performance
Main memory 10 TB
High bandwidth (32 GB/s), low latency network
connecting nodes.
Disk
450 TB for systems operations
250 TB for users.
Mass Storage system 12 Automatic Cartridge
Systems (U.S. made STK PowderHorn9310) total
storage capacity is approximately 1.6 PB.

27
Cray SV2 Parallel Vector Architecture

12.8 Gflop/s Vector processors
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
64 CPUs/800 GFLOPS in LC cabinet

28
Machine Model 4 Clusters of SMPs

SMPs are the fastest commodity machine, so use
them as a building block for a larger machine
with a network
Common names
CLUMP Cluster of SMPs
Hierarchical machines, constellations
Most modern machines look like this
Millennium, IBM SPs, (not the t3e)...
What is an appropriate programming model 4 ???
Treat machine as flat, always use message
passing, even within SMP (simple, but ignores an
important part of memory hierarchy).
Shared memory within one SMP, but message passing
outside of an SMP.

29
Cluster of SMP Approach

A supercomputer is a stretched high-end server
Parallel system is built by assembling nodes that
are modest size, commercial, SMP servers just
put more of them together

Image from LLNL
30
NERSC-3 Vital Statistics

5 Teraflop/s Peak Performance 3.05 Teraflop/s
with Linpack
208 nodes, 16 CPUs per node at 1.5 Gflop/s per
CPU
Worst case Sustained System Performance measure
.358 Tflop/s (7.2)
Best Case Gordon Bell submission 2.46 on 134
nodes (77)
4.5 TB of main memory
140 nodes with 16 GB each, 64 nodes with 32 GBs,
and 4 nodes with 64 GBs.
40 TB total disk space
20 TB formatted shared, global, parallel, file
space 15 TB local disk for system usage
Unique 512 way Double/Single switch configuration

31
Outline

Overview of parallel machines and programming
models
Shared memory
Shared address space
Message passing
Data parallel
Clusters of SMPs
Trends in real machines

32
TOP500
- Listing of the 500 most powerful
Computers in the World - Yardstick Rmax from
Linpack Axb, dense problem - Updated twice
a year ISCxy in Germany, June xy SCxy in
USA, November xy - All data available from
www.top500.org
TPP performance
Rate
Size
33
TOP500 list - Data shown

Manufacturer Manufacturer or vendor
Computer Type indicated by manufacturer or
vendor
Installation Site Customer
Location Location and country
Year Year of installation/last major update
Customer Segment Academic,Research,Industry,Vendor
,Class.
Processors Number of processors
Rmax Maxmimal LINPACK performance
achieved
Rpeak Theoretical peak performance
Nmax Problemsize for achieving Rmax
N1/2 Problemsize for achieving half of Rmax
Nworld Position within the TOP500 ranking

34
TOP10
35
TOP500 - Performance
36
Analysis of TOP500 Data

Annual performance growth about a factor of 1.82
Two factors contribute almost equally to the
annual total performance growth
Processor number grows per year on the average by
a factor of 1.30 and the
Processor performance grows by 1.40 compared to
1.58 of Moore's Law
Strohmaier, Dongarra, Meuer, and Simon, Parallel
Computing 25, 1999, pp 1517-1544.

37
Manufacturers
38
Processor Type
39
Chip Technology
40
Chip Technology
41
Architectures
42
NOW - Cluster
43
Summary

Historically, each parallel machine was unique,
along with its programming model and programming
language.
It was necessary to throw away software and start
over with each new kind of machine.
Now we distinguish the programming model from the
underlying machine, so we can write portably
correct codes that run on many machines.
MPI now the most portable option, but can be
tedious.
Writing portably fast code requires tuning for
the architecture.
Algorithm design challenge is to make this
process easy.
Example picking a blocksize, not rewriting whole
algorithm.