CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle

Description:

Overview of parallel machines and programming models. Shared memory ... Software is or should be free (Stallman) All source code is 'open' Everyone is a tester ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 44
Provided by: kathyy151
Category:

less

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 4: Introduction to Parallel Computers and Paralle


1
CS 267 Applications of Parallel
ComputersLecture 4Introduction to Parallel
Computers and Parallel Programming Methodologies
  • Horst D. Simon
  • http//www.cs.berkeley.edu/strive/cs267

2
Outline
  • Overview of parallel machines and programming
    models
  • Shared memory
  • Shared address space
  • Message passing
  • Data parallel
  • Clusters of SMPs
  • Trends in real machines

3
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
  • Where is the memory physically located?

4
Parallel Programming Models
  • Control
  • How is parallelism created?
  • What orderings exist between operations?
  • How do different threads of control synchronize?
  • Data
  • What data is private vs. shared?
  • How is logically shared data accessed or
    communicated?
  • Operations
  • What are the atomic operations?
  • Cost
  • How do we account for the cost of each of the
    above?

5
Simple Example
  • Consider a sum of an array function
  • Parallel Decomposition
  • Each evaluation and each partial sum is a task.
  • Assign n/p numbers to each of p procs
  • Each computes independent private results and
    partial sum.
  • One (or all) collects the p partial sums and
    computes the global sum.
  • Two Classes of Data
  • Logically Shared
  • The original n numbers, the global sum.
  • Logically Private
  • The individual function evaluations.
  • What about the individual partial sums?

6
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Many languages allow threads to be created
    dynamically, I.e., mid-execution.
  • Each thread has a set of private variables, e.g.
    local variables on the stack.
  • Collectively with a set of shared variables,
    e.g., static variables, shared common blocks,
    global heap.
  • Threads communicate implicitly by writing and
    reading shared variables.
  • Threads coordinate using synchronization
    operations on shared variables

Address
x ...
Shared
y ..x ...
Private
i res s
7
Machine Model 1a Shared Memory
  • Processors all connected to a large shared
    memory.
  • Typically called Symmetric Multiprocessors (SMPs)
  • Sun, HP, Intel, IBM SMPs (nodes of Millennium,
    SP)
  • Local memory is not (usually) part of the
    hardware.
  • Cost much cheaper to access data in cache than
    in main memory.
  • Difficulty scaling to large numbers of processors
  • lt32 processors typical
  • Advantage uniform memory access (UMA)

8
Example Problem in Scaling Shared Memory
From Pat Worley, ORNL
9
Example Problem in Scaling Shared Memory
From Pat Worley, ORNL
10
Machine Model 1b Distributed Shared Memory
  • Memory is logically shared, but physically
    distributed
  • Any processor can access any address in memory
  • Cache lines (or pages) are passed around machine
  • SGI Origin is canonical example ( research
    machines)
  • Scales to 100s
  • Limitation is cache consistency protocols need
    to keep cached copies of the same address
    consistent

P1
Pn
network
memory
memory
memory
11
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
What is the problem?
  • A race condition or data race occurs when
  • two processors (or two threads) access the same
    variable, and at least one does a write.
  • The accesses are concurrent (not synchronized)

12
Pitfalls and Solution via Synchronization
  • Pitfall in computing a global sum s s
    local_si

Thread 1 (initially s0) load s from mem to
reg s slocal_s1 local_s1, in reg
store s from reg to mem
Thread 2 (initially s0) load s from
mem to reg initially 0 s slocal_s2
local_s2, in reg store s from reg to mem
Time
  • Instructions from different threads can be
    interleaved arbitrarily.
  • One of the additions may be lost
  • Possible solution mutual exclusion with locks

Thread 1 lock load s s slocal_s1
store s unlock
Thread 2 lock load s s slocal_s2
store s unlock
13
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.
  • MPI is the most common example

A
A
n
0
14
Computing s x(1)x(2) on each processor
  • First possible solution

Processor 2 receive xremote, proc1 send
xlocal, proc1 xlocal x(2) s
xlocal xremote
Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
  • Second possible solution -- what could go wrong?

Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
Processor 2 send xlocal, proc1
xlocal x(2) receive xremote, proc1 s
xlocal xremote
  • What if send/receive acts like the telephone
    system? The post office?

15
MPI the de facto standard
  • In 2002 MPI has become the de facto standard for
    parallel computing
  • The software challenge overcoming the MPI
    barrier
  • MPI created finally a standard for applications
    development in the HPC community
  • Standards are always a barrier to further
    development
  • The MPI standard is a least common denominator
    building on mid-80s technology
  • Programming Model reflects hardware!

I am not sure how I will program a Petaflops
computer, but I am sure that I will need MPI
somewhere HDS 2001
16
Machine Model 2 Distributed Memory
  • Cray T3E, IBM SP, Millennium.
  • Each processor is connected to its own memory and
    cache but cannot directly access another
    processors memory.
  • Each node has a network interface (NI) for all
    communication and synchronization.

17
PC Clusters Contributions of Beowulf
  • An experiment in parallel computing systems
  • Established vision of low cost, high end
    computing
  • Demonstrated effectiveness of PC clusters for
    some (not all) classes of applications
  • Provided networking software
  • Conveyed findings to broad community (great PR)
  • Tutorials and book
  • Design standard to rally
  • community!
  • Standards beget
  • books, trained people,
  • software virtuous cycle

Adapted from Gordon Bell, presentation at
Salishan 2000
18
Linuss Law Linux Everywhere
  • Software is or should be free (Stallman)
  • All source code is open
  • Everyone is a tester
  • Everything proceeds a lot faster when everyone
    works on one code (HPC nothing gets done if
    resources are scattered)
  • Anyone can support and market the code for any
    price
  • Zero cost software attracts users!
  • All the developers write lots of code
  • Prevents community from losing HPC software (CM5,
    T3E)

19
Commercially Integrated Tflop/s Clusters Are
Happening today
  • Shell largest engineering/scientific cluster
  • NCSA 1024 processor cluster (IA64)
  • Univ. Heidelberg cluster
  • PNNL announced 8 Tflops (peak) IA64 cluster from
    HP with Quadrics interconnect
  • DTF in US announced 4 clusters for a total of 13
    Teraflops (peak)

But make no mistake Itanium and McKinley are
not a commodity product
20
Internet Computing- SETI_at_home
  • Running on 500,000 PCs, 1000 CPU Years per Day
  • 485,821 CPU Years so far
  • Sophisticated Data Signal Processing Analysis
  • Distributes Datasets from Arecibo Radio Telescope

Next Step- Allen Telescope Array
21
Programming Model 2b Global Addr Space
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Local and shared data, as in shared memory model
  • But, shared data is partitioned over local
    processes
  • Remote data stays remote on distributed memory
    machines
  • Processes communicate by writes to shared
    variables
  • Explicit synchronization needed to coordinate
  • UPC, Titanium, Split-C are some examples
  • Global Address Space programming is an
    intermediate point between message passing and
    shared memory
  • Most common on a the Cray T3E, which had some
    hardware support for remote reads/writes

22
Programming Model 3 Data Parallel
  • Single thread of control consisting of parallel
    operations.
  • Parallel operations applied to all (or a defined
    subset) of a data structure, usually an array
  • Communication is implicit in parallel operators
  • Elegant and easy to understand and reason about
  • Coordination is implicit statements executed
    synchronousl
  • Drawbacks
  • Not all problems fit this model
  • Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
23
Machine Model 3a SIMD System
  • A large number of (usually) small processors.
  • A single control processor issues each
    instruction.
  • Each processor executes the same instruction.
  • Some processors may be turned off on some
    instructions.
  • Machines are not popular (CM2), but programming
    model is.

control processor
. . .
interconnect
  • Implemented by mapping n-fold parallelism to p
    processors.
  • Mostly done in the compilers (e.g., HPF).

24
Model 3b Vector Machines
  • Vector architectures are based on a single
    processor
  • Multiple functional units
  • All performing the same operation
  • Instructions may specific large amounts of
    parallelism (e.g., 64-way) but hardware executes
    only a subset in parallel
  • Historically important
  • Overtaken by MPPs in the 90s
  • Still visible as a processor architecture within
    an SMP

25
Earth Simulator Architecture Optimizing for the
full range of tasks
  • Parallel Vector Architecture
  • High speed (vector) processors
  • High memory bandwidth (vector architecture)
  • Fast network (new crossbar switch)

Rearranging commodity parts cant match this
performance
26
Earth Simulator Configuration of a General
Purpose Supercomputer
  • 640 nodes
  • 8 vector processors of 8 GFLOPS and 16GB shared
    memories per node.
  • Total of 5,120 processors
  • Total 40 Tflop/s peak performance
  • Main memory 10 TB
  • High bandwidth (32 GB/s), low latency network
    connecting nodes.
  • Disk
  • 450 TB for systems operations
  • 250 TB for users.
  • Mass Storage system 12 Automatic Cartridge
    Systems (U.S. made STK PowderHorn9310) total
    storage capacity is approximately 1.6 PB.

27
Cray SV2 Parallel Vector Architecture
  • 12.8 Gflop/s Vector processors
  • 4 processor nodes sharing up to 64 GB of memory
  • Single System Image to 4096 Processors
  • 64 CPUs/800 GFLOPS in LC cabinet

28
Machine Model 4 Clusters of SMPs
  • SMPs are the fastest commodity machine, so use
    them as a building block for a larger machine
    with a network
  • Common names
  • CLUMP Cluster of SMPs
  • Hierarchical machines, constellations
  • Most modern machines look like this
  • Millennium, IBM SPs, (not the t3e)...
  • What is an appropriate programming model 4 ???
  • Treat machine as flat, always use message
    passing, even within SMP (simple, but ignores an
    important part of memory hierarchy).
  • Shared memory within one SMP, but message passing
    outside of an SMP.

29
Cluster of SMP Approach
  • A supercomputer is a stretched high-end server
  • Parallel system is built by assembling nodes that
    are modest size, commercial, SMP servers just
    put more of them together

Image from LLNL
30
NERSC-3 Vital Statistics
  • 5 Teraflop/s Peak Performance 3.05 Teraflop/s
    with Linpack
  • 208 nodes, 16 CPUs per node at 1.5 Gflop/s per
    CPU
  • Worst case Sustained System Performance measure
    .358 Tflop/s (7.2)
  • Best Case Gordon Bell submission 2.46 on 134
    nodes (77)
  • 4.5 TB of main memory
  • 140 nodes with 16 GB each, 64 nodes with 32 GBs,
    and 4 nodes with 64 GBs.
  • 40 TB total disk space
  • 20 TB formatted shared, global, parallel, file
    space 15 TB local disk for system usage
  • Unique 512 way Double/Single switch configuration

31
Outline
  • Overview of parallel machines and programming
    models
  • Shared memory
  • Shared address space
  • Message passing
  • Data parallel
  • Clusters of SMPs
  • Trends in real machines

32
TOP500
- Listing of the 500 most powerful
Computers in the World - Yardstick Rmax from
Linpack Axb, dense problem - Updated twice
a year ISCxy in Germany, June xy SCxy in
USA, November xy - All data available from
www.top500.org
TPP performance
Rate
Size
33
TOP500 list - Data shown
  • Manufacturer Manufacturer or vendor
  • Computer Type indicated by manufacturer or
    vendor
  • Installation Site Customer
  • Location Location and country
  • Year Year of installation/last major update
  • Customer Segment Academic,Research,Industry,Vendor
    ,Class.
  • Processors Number of processors
  • Rmax Maxmimal LINPACK performance
    achieved
  • Rpeak Theoretical peak performance
  • Nmax Problemsize for achieving Rmax
  • N1/2 Problemsize for achieving half of Rmax
  • Nworld Position within the TOP500 ranking

34
TOP10
35
TOP500 - Performance
36
Analysis of TOP500 Data
  • Annual performance growth about a factor of 1.82
  • Two factors contribute almost equally to the
    annual total performance growth
  • Processor number grows per year on the average by
    a factor of 1.30 and the
  • Processor performance grows by 1.40 compared to
    1.58 of Moore's Law
  • Strohmaier, Dongarra, Meuer, and Simon, Parallel
    Computing 25, 1999, pp 1517-1544.

37
Manufacturers
38
Processor Type
39
Chip Technology
40
Chip Technology
41
Architectures
42
NOW - Cluster
43
Summary
  • Historically, each parallel machine was unique,
    along with its programming model and programming
    language.
  • It was necessary to throw away software and start
    over with each new kind of machine.
  • Now we distinguish the programming model from the
    underlying machine, so we can write portably
    correct codes that run on many machines.
  • MPI now the most portable option, but can be
    tedious.
  • Writing portably fast code requires tuning for
    the architecture.
  • Algorithm design challenge is to make this
    process easy.
  • Example picking a blocksize, not rewriting whole
    algorithm.
Write a Comment
User Comments (0)
About PowerShow.com