# CS 267: Introduction to Parallel Machines and Programming Models - PowerPoint PPT Presentation

PPT – CS 267: Introduction to Parallel Machines and Programming Models PowerPoint presentation | free to download - id: 1a3cbf-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CS 267: Introduction to Parallel Machines and Programming Models

Description:

### Introduction to Parallel Machines and Programming Models ... Design standard to rally. community! Standards beget: books, trained people, software ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 48
Provided by: kathyy
Category:
Tags:
Transcript and Presenter's Notes

Title: CS 267: Introduction to Parallel Machines and Programming Models

1
CS 267 Introduction to Parallel Machines and
Programming Models
• Katherine Yelick
• yelick_at_cs.berkeley.edu
• http//www.cs.berkeley.edu/yelick/cs267

2
Outline
• Overview of parallel machines and programming
models
• Shared memory
• Message passing
• Data parallel
• Clusters of SMPs
• Trends in real machines

3
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
• Where is the memory physically located?

4
Parallel Programming Models
• Control
• How is parallelism created?
• What orderings exist between operations?
• How do different threads of control synchronize?
• Data
• What data is private vs. shared?
• How is logically shared data accessed or
communicated?
• Operations
• What are the atomic (indivisible) operations?
• Cost
• How do we account for the cost of each of the
above?

5
Simple Example
• Consider a sum of an array function
• Parallel Decomposition
• Each evaluation and each partial sum is a task.
• Assign n/p numbers to each of p procs
• Each computes independent private results and
partial sum.
• One (or all) collects the p partial sums and
computes the global sum.
• Two Classes of Data
• Logically Shared
• The original n numbers, the global sum.
• Logically Private
• The individual function evaluations.
• What about the individual partial sums?

6
Programming Model 1 Shared Memory
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in
some languages
• Each thread has a set of private variables, e.g.,
local stack variables
• Also a set of shared variables, e.g., static
variables, shared common blocks, or global heap.
• Threads communicate implicitly by writing and
• Threads coordinate by synchronizing on shared
variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
7
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 for i 0, n/2-1 s s
f(Ai)
Thread 2 for i n/2, n-1 s s
f(Ai)
• Problem is a race condition on variable s in the
program
• A race condition or data race occurs when
• two processors (or two threads) access the same
variable, and at least one does a write.
• The accesses are concurrent (not synchronized) so
they could happen simultaneously

8
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 . compute f(Ai) and put in
reg0 reg1 s reg1 reg1 reg0 s
reg1
Thread 2 compute f(Ai) and put in reg0
reg1 s reg1 reg1 reg0 s reg1
7
9
27
27
34
36
36
34
• Assume s27, f(Ai)7 on Thread1 and 9 on
• For this program to work, s should be 43 at the
end
• but it may be 43, 34, or 36
• The atomic operations are reads and writes
• Never see ½ of one number
• All computations happen in (private) registers

9
Improved Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
• Since addition is associative, its OK to
rearrange order
• Most computation is on private variables
• Sharing frequency is also reduced, which might
improve speed
• But there is still a race condition on the update
of shared s
• The race condition can be fixed by adding locks
(only one thread can hold a lock at a time
others wait for it)

10
Machine Model 1a Shared Memory
• Processors all connected to a large shared
memory.
• Typically called Symmetric Multiprocessors (SMPs)
• Sun, HP, Intel, IBM SMPs (nodes of Millennium,
SP)
• Local memory is not (usually) part of the
hardware abstraction.
• Difficulty scaling to large numbers of processors
• lt32 processors typical
• Advantage uniform memory access (UMA)
• Cost much cheaper to access data in cache than
main memory.

P2
P1
Pn

network/bus
memory
11
Problems Scaling Shared Memory
• Why not put more processors on (with larger
memory?)
• The memory bus becomes a bottleneck
• Example from a Parallel Spectral Transform
Shallow Water Model (PSTSWM) demonstrates the
problem
• Experimental results (and slide) from Pat Worley
at ORNL
• This is an important kernel in atmospheric models
• 99 of the floating point operations are
multiplies or adds, which generally run well on
all processors
• But it does sweeps through memory with little
reuse of operands, which exercises the memory
system
• These experiments show serial performance, with
one copy of the code running independently on
varying numbers of procs
• The best case for shared memory no sharing
• But the data doesnt all fit in the
registers/cache

12
Example Problem in Scaling Shared Memory
• Performance degradation is a smooth function of
the number of processes.
• No shared data between them, so there should be
perfect parallelism.
• (Code was run for a 18 vertical levels with a
range of horizontal sizes.)

From Pat Worley, ORNL
13
Machine Model 1b Distributed Shared Memory
• Memory is logically shared, but physically
distributed
• Any processor can access any address in memory
• Cache lines (or pages) are passed around machine
• SGI Origin is canonical example ( research
machines)
• Scales to 100s
• Limitation is cache coherent protocols need to
keep cached copies of the same address consistent

P2
P1
Pn

network
memory
memory
memory
14
Programming Model 2 Message Passing
• Program consists of a collection of named
processes.
• Usually fixed at program startup time
shared data.
• Logically shared data is partitioned over local
processes.
• Processes communicate by explicit send/receive
pairs
• Coordination is implicit in every communication
event.
• MPI is the most common example

Private memory
y ..s ...
Pn
P1
P0
Network
15
Computing s A1A2 on each processor
• First possible solution what could go wrong?

Processor 1 xlocal A1 send xlocal,
proc2 receive xremote, proc2 s xlocal
xremote
Processor 2 xlocal A2 send xlocal,
proc1 receive xremote, proc1 s xlocal
xremote
• If send/receive acts like the telephone system?
The post office?

16
MPI the de facto standard
• In 2002 MPI has become the de facto standard for
parallel computing
• The software challenge overcoming the MPI
barrier
• MPI created finally a standard for applications
development in the HPC community
• Standards are always a barrier to further
development
• The MPI standard is a least common denominator
building on mid-80s technology
• Programming Model reflects hardware!

I am not sure how I will program a Petaflops
computer, but I am sure that I will need MPI
somewhere HDS 2001
17
Machine Model 2a Distributed Memory
• Cray T3E, NOW, IBM SP2
• IBM SP-3, Millennium, CITRIS are distributed
memory machines, but the nodes are SMPs.
• Each processor has its own memory and cache but
cannot directly access another processors
memory.
• Each node has a network interface (NI) for all
communication and synchronization.

18
PC Clusters Contributions of Beowulf
• An experiment in parallel computing systems
• Established vision of low cost, high end
computing
• Demonstrated effectiveness of PC clusters for
some (not all) classes of applications
• Provided networking software
• Conveyed findings to broad community (great PR)
• Tutorials and book
• Design standard to rally
• community!
• Standards beget
• books, trained people,
• software virtuous cycle

Adapted from Gordon Bell, presentation at
Salishan 2000
19
Open Source Software Model for HPC
• Linus's law, named after Linus Torvalds, the
creator of Linux, states that "given enough
eyeballs, all bugs are shallow".
• All source code is open
• Everyone is a tester
• Everything proceeds a lot faster when everyone
works on one code (HPC nothing gets done if
resources are scattered)
• Software is or should be free (Stallman)
• Anyone can support and market the code for any
price
• Zero cost software attracts users!
• Prevents community from losing HPC software (CM5,
T3E)

20
Tflop/s Clusters
• The following are examples of clusters configured
out of separate networks and processor components
• Shell largest engineering/scientific cluster
• NCSA 1024 processor cluster (IA64)
• Univ. Heidelberg cluster
• PNNL announced 8 Tflops (peak) IA64 cluster from
• DTF in US announced 4 clusters for a total of 13
Teraflops (peak)

But make no mistake Itanium and McKinley are
not a commodity product
21
Internet Computing- SETI_at_home
• Running on 500,000 PCs, 1000 CPU Years per Day
• 485,821 CPU Years so far
• Sophisticated Data Signal Processing Analysis
• Distributes Datasets from Arecibo Radio Telescope

Next Step- Allen Telescope Array
22
Programming Model 2b Global Addr Space
• Program consists of a collection of named
• Usually fixed at program startup time
• Local and shared data, as in shared memory model
• But, shared data is partitioned over local
processes
• Cost models says remote data is expensive
• Examples UPC, Titanium, Co-Array Fortran
• Global Address Space programming is an
intermediate point between message passing and
shared memory

Shared memory
sn 27
s0 27
s1 27
y ..si ...
Private memory
Pn
P1
P0
23
Machine Model 2b Global Address Space
• Cray T3D, T3E, X1, and HP Alphaserver cluster
• Clusters built with Quadrics, Myrinet, or
Infiniband
• The network interface supports RDMA (Remote
Direct Memory Access)
• NI can directly access memory without
interrupting the CPU
• One processor can read/write memory with
one-sided operations (put/get)
• Not just a load/store as on a shared memory
machine
• Remote data is typically not cached locally

Global address space may be supported in varying
degrees
24
Programming Model 3 Data Parallel
• Single thread of control consisting of parallel
operations.
• Parallel operations applied to all (or a defined
subset) of a data structure, usually an array
• Communication is implicit in parallel operators
• Elegant and easy to understand and reason about
• Coordination is implicit statements executed
synchronously
• Similar to Matlab language for array operations
• Drawbacks
• Not all problems fit this model
• Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
25
Machine Model 3a SIMD System
• A large number of (usually) small processors.
• A single control processor issues each
instruction.
• Each processor executes the same instruction.
• Some processors may be turned off on some
instructions.
• Machines are very specialized to scientific
computing, so they are not popular with vendors
(CM2, Maspar)
• Programming model can be implemented in the
compiler
• mapping n-fold parallelism to p processors, n gtgt
p, but its hard (e.g., HPF)

control processor
. . .
interconnect
26
Model 3b Vector Machines
• Vector architectures are based on a single
processor
• Multiple functional units
• All performing the same operation
• Instructions may specific large amounts of
parallelism (e.g., 64-way) but hardware executes
only a subset in parallel
• Historically important
• Overtaken by MPPs in the 90s
• Re-emerging in recent years
• At a large scale in the Earth Simulator (NEC SX6)
and Cray X1
• At a small sale in SIMD media extensions to
microprocessors
• SSE, SSE2 (Intel Pentium/IA64)
• Altivec (IBM/Motorola/Apple PowerPC)
• VIS (Sun Sparc)
• Key idea Compiler does some of the difficult
work of finding parallelism, so the hardware
doesnt have to

27
Vector Processors
• Vector instructions operate on a vector of
elements
• These are specified as operations on vector
registers
• A supercomputer vector register holds 32-64 elts
• The number of elements is larger than the amount
of parallel hardware, called vector pipes or
lanes, say 2-4
• The hardware performs a full vector operation in
• elements-per-vector-register / pipes

r1
r2

(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
28
Cray X1 Node
• Cray X1 builds a larger virtual vector, called
an MSP
• 4 SSPs (each a 2-pipe vector processor) make up
an MSP
• Compiler will (try to) vectorize/parallelize
across the MSP

custom blocks
12.8 Gflops (64 bit)
25.6 Gflops (32 bit)
25-41 GB/s
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
29
Cray X1 Parallel Vector Architecture
• Cray combines several technologies in the X1
• 12.8 Gflop/s Vector processors (MSP)
• Shared caches (unusual on earlier vector
machines)
• 4 processor nodes sharing up to 64 GB of memory
• Single System Image to 4096 Processors
• Remote put/get between nodes (faster than MPI)

30
Earth Simulator Architecture
• Parallel Vector Architecture
• High speed (vector) processors
• High memory bandwidth (vector architecture)
• Fast network (new crossbar switch)

Rearranging commodity parts cant match this
performance
31
Machine Model 4 Clusters of SMPs
• SMPs are the fastest commodity machine, so use
them as a building block for a larger machine
with a network
• Common names
• CLUMP Cluster of SMPs
• Hierarchical machines, constellations
• Most modern machines look like this
• Millennium, IBM SPs, (not the t3e)...
• What is an appropriate programming model 4 ???
• Treat machine as flat, always use message
passing, even within SMP (simple, but ignores an
important part of memory hierarchy).
• Shared memory within one SMP, but message passing
outside of an SMP.

32
Cluster of SMP Approach
• A supercomputer is a stretched high-end server
• Parallel system is built by assembling nodes that
are modest size, commercial, SMP servers just
put more of them together

Image from LLNL
33
Outline
• Overview of parallel machines and programming
models
• Shared memory
• Message passing
• Data parallel
• Clusters of SMPs
• Trends in real machines

34
TOP500
- Listing of the 500 most powerful
Computers in the World - Yardstick Rmax from
Linpack Axb, dense problem - Updated twice
a year ISCxy in Germany, June xy SCxy in
USA, November xy - All data available from
www.top500.org
TPP performance
Rate
Size
35
TOP500 list - Data shown
• Manufacturer Manufacturer or vendor
• Computer Type indicated by manufacturer or
vendor
• Installation Site Customer
• Location Location and country
• Year Year of installation/last major update
,Class.
• Processors Number of processors
• Rmax Maxmimal LINPACK performance
achieved
• Rpeak Theoretical peak performance
• Nmax Problemsize for achieving Rmax
• N1/2 Problemsize for achieving half of Rmax
• Nworld Position within the TOP500 ranking

36
22nd List The TOP10
37
Continents Performance
38
Continents Performance
39
Customer Types
40
Manufacturers
41
Manufacturers Performance
42
Processor Types
43
Architectures
44
NOW Clusters
45
Analysis of TOP500 Data
• Annual performance growth about a factor of 1.82
• Two factors contribute almost equally to the
annual total performance growth
• Processor number grows per year on the average by
a factor of 1.30 and the
• Processor performance grows by 1.40 compared to
1.58 of Moore's Law
• Strohmaier, Dongarra, Meuer, and Simon, Parallel
Computing 25, 1999, pp 1517-1544.

46
Summary
• Historically, each parallel machine was unique,
along with its programming model and programming
language.
• It was necessary to throw away software and start
over with each new kind of machine.
• Now we distinguish the programming model from the
underlying machine, so we can write portably
correct codes that run on many machines.
• MPI now the most portable option, but can be
tedious.
• Writing portably fast code requires tuning for
the architecture.
• Algorithm design challenge is to make this
process easy.
• Example picking a blocksize, not rewriting whole
algorithm.

47
• Cray X1
• http//www.sc-conference.org/sc2003/paperpdfs/pap1
83.pdf
• Clusters
• http//www.mirror.ac.uk/sites/www.beowulf.org/pape
rs/ICPP95/
• "Parallel Computer Architecture A
Hardware/Software Approach" by Culler, Singh, and
Gupta, Chapter 1.
• Next week Current high performance architectures
• Shared memory (for Monday)
• Memory Consistency and Event Ordering in Scalable
Shared-Memory  Multiprocessors, Gharachorloo et
al, Proceedings of the International symposium on
Computer Architecture, 1990.