Scalable Parallel Architectures - PowerPoint PPT Presentation

Loading...

PPT – Scalable Parallel Architectures PowerPoint presentation | free to download - id: 7465b3-YjMyZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Scalable Parallel Architectures

Description:

Scalable Parallel Architectures and their Software – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 71
Provided by: MG96
Learn more at: http://rohan.sdsu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Scalable Parallel Architectures


1

Scalable Parallel Architectures and their
Software

2
Introduction
  • Overview of RISC CPUs, Memory Hierarchy
  • Parallel Systems - General Hardware Layout (SMP,
    Distributed, Hybrid)
  • Communications Networks for Parallel Systems
  • Parallel I/O
  • Operating Systems Concepts
  • Overview of Parallel Programming Methodologies
  • Distributed Memory
  • Shared-Memory
  • Hardware Specifics of NPACI Parallel Machines
  • IBM SP Blue Horizon
  • New CPU Architectures
  • IBM Power 4
  • Intel IA-64

3
What is Parallel Computing?
  • Parallel computing the use of multiple computers
    or processors or processes working together on a
    common task.
  • Each processor works on its section of the
    problem
  • Processors are allowed to exchange information
    (data in local memory) with other processors

Grid of Problem to be solved
CPU 1 works on this area of the problem
CPU 2 works on this area of the problem
exchange
y
CPU 3 works on this area of the problem
CPU 4 works on this area of the problem
exchange
x
4
Why Parallel Computing?
  • Limits of single-CPU computing
  • Available memory
  • Performance - usually time to solution
  • Limits of Vector Computers main HPC alternative
  • System cost, including maintenance
  • Cost/MFlop
  • Parallel computing allows
  • Solving problems that dont fit on a single CPU
  • Solving problems that cant be solved in a
    reasonable time on one CPU
  • We can run
  • Larger problems
  • Finer resolution
  • Faster
  • More cases


5
Scalable Parallel Computer Systems
  • (Scalable) ( CPUs) (Memory) (I/O)
    (Interconnect) (OS) Scalable Parallel
    Computer System

6
Scalable Parallel Computer Systems
  • Scalablity A parallel system is scalable if it
    is capable of providing enhanced resources to
    accommodate increasing performance and/or
    functionality
  • Resource scalability scalability achieved by
    increasing machine size ( CPUs, memory, I/O,
    network, etc.)
  • Application scalability
  • machine size
  • problem size

7
Shared and Distributed Memory Systems
  • Multiprocessor (Shared memory)
  • Single address space. All processors
  • have access to a pool of shared memory.
  • Examples SUN HPC, CRAY T90, NEC SX-6
  • Methods of memory access
  • - Bus
  • - Crossbar
  • Multicomputer (Distributed memory)
  • Each processor has its own local
  • memory.
  • Examples CRAY T3E, IBM SP2,
  • PC Cluster

8
Hybrid (SMP Clusters) Systems
Hybrid Architecture Processes share memory
on-node, may/must use message-passing off-node,
may share off-node memory Example IBM SP Blue
Horizon, SGI Origin, Compaq Alphaserver
9
RISC-Based Computer Hardware Concepts
  • RISC CPUs most common CPUs in HPC many design
    concepts transferred from vector CPUs to RISC to
    CISC
  • Multiple Functional Units
  • Pipelined Instructions
  • Memory Hierarchy
  • Instructions typically take 1-several CPU clock
    cycles
  • Clock cycles provide time scale for measurement
  • Data transfers memory-to-CPU, network, I/O, etc.

10
Processor Related Terms
Laura C. Nett Instruction set is just how each
operation is processed xy1 load y and a add
y and a put in x
  • RISC Reduced Instruction Set Computers
  • PIPELINE Technique where multiple instructions
    are overlapped in execution
  • SUPERSCALAR Computer design feature - multiple
    instructions can be executed per clock period

11
Typical RISC CPU
CPU Chip
Functional Units
registers
FP Add
FP Multiply
Memory/Cache
FP Multiply Add
FP Divide
12
Functional Unit
13
Dual Hardware Pipes
A(I) C(I)D(I)
odd C(I)
odd C(I)
A(I) A(I1)
even C(I1)
even D(I1)
14
RISC Memory/Cache Related Terms
  • ICACHE Instruction cache
  • DCACHE (Level 1) Data cache closest to
    registers
  • SCACHE (Level 2) Secondary data cache
  • Data from SCACHE has to go through DCACHE to
    registers
  • SCACHE is larger than DCACHE
  • All processors do not have SCACHE
  • CACHE LINE Minimum transfer unit (usually in
    bytes) for moving data between different levels
    of memory hierarchy
  • TLB Translation-look-aside buffer keeps
    addresses of pages ( block of memory) in main
    memory that have been recently accessed
  • MEMORY BANDWIDTH Transfer rate (in MBytes/sec)
    between different levels of memory
  • MEMORY ACCESS TIME Time required (often measured
    in clock cycles) to bring data items from one
    level in memory to another
  • CACHE COHERENCY Mechanism for ensuring data
    consistency of shared variables across memory
    hierarchy

15
RISC CPU, CACHE, and MEMORY Basic Layout
CPU
Registers
Level 1 Cache
Level 2 Cache
MAIN MEMORY
16
RISC Memory/Cache Related Terms
  • ICACHE Instruction cache
  • DCACHE (Level 1) Data cache closest to
    registers
  • SCACHE (Level 2) Secondary data cache
  • Data from SCACHE has to go through DCACHE to
    registers
  • SCACHE is larger than DCACHE
  • All processors do not have SCACHE
  • CACHE LINE Minimum transfer unit (usually in
    bytes) for moving data between different levels
    of memory hierarchy
  • TLB Translation-look-aside buffer keeps
    addresses of pages ( block of memory) in main
    memory that have been recently accessed
  • MEMORY BANDWIDTH Transfer rate (in MBytes/sec)
    between different levels of memory
  • MEMORY ACCESS TIME Time required (often measured
    in clock cycles) to bring data items from one
    level in memory to another
  • CACHE COHERENCY Mechanism for ensuring data
    consistency of shared variables across memory
    hierarchy

17
RISC Memory/Cache Related Terms (cont.)
  • Direct mapped cache A block from main memory can
    go in exactly one place in the cache. This is
    called direct mapped because there is direct
    mapping from any block address in memory to a
    single location in the cache.

cache
Main memory
18
RISC Memory/Cache Related Terms (cont.)
  • Fully associative cache A block from main
    memory can be placed in any location in the
    cache. This is called fully associative because a
    block in main memory may be associated with any
    entry in the cache.

cache
Main memory
19
RISC Memory/Cache Related Terms (cont.)
  • Set associative cache The middle range of
    designs between direct mapped cache and fully
    associative cache is called set-associative
    cache. In a n-way set-associative cache a block
    from main memory can go into n (n at least 2)
    locations in the cache.

2-way set-associative cache
Main memory
20
RISC Memory/Cache Related Terms
  • The data cache was designed to allow programmers
    to take advantage of common data access patterns
  • Spatial Locality
  • When an array element is referenced, its
    neighbors are likely to be referenced
  • Cache lines are fetched together
  • Work on consecutive data elements in the same
    cache line
  • Temporal Locality
  • When an array element is referenced, it is likely
    to be referenced again soon
  • Arrange code so that data in cache is reused as
    often as possible

21
Typical RISC Floating-Point Operation Times
  • IBM POWER3 II
  • CPU Clock Speed 375 MHz ( 3 ns)

Instruction 32-Bit 64-Bit
FP Multiply or Add 3-4 3-4
FP Multiply-Add 3-4 3-4
FP Square Root 14-23 22-31
FP Divide 14-21 18-25
22
Typical RISC Memory Access Times
  • IBM POWER3 II

Access Bandwidth (GB/s) Time (Cycles)
Load Register From L1 3.2 1
Store Register To L1 1.6 1
Load/Store L1 from/to L2 6.4 9
Load/Store L1 From/to RAM 1.6 35
23
Single CPU Optimization
  • Optimization of serial (single CPU) version is
    very important
  • Want to parallelize best serial version where
    appropriate

24
New CPUs in HPC
  • New CPU designs with new features
  • IBM POWER 4
  • U Texas Regatta nodes covered on Wednesday
  • Intel IA-64
  • SDSC DTF TeraGrid PC Linux Cluster

25
Parallel Networks
  • Network function is to transfer data from source
    to destination in support of network transactions
    used to realize supported programming model(s).
  • Data transfer can be for message-passing and/or
    shared-memory operations.
  • Network Terminology
  • Common Parallel Networks

26
System Interconnect Topologies
Send Information among CPUs through a Network -
Best choice would be a fully connected network in
which each processor has a direct link to every
other processor Fully Connected Network. This
type of network would be very expensive and
difficult to scale NN. Instead, processors are
arranged in some variation of a mesh, torus,
hypercube, etc.
2-D Mesh
2-D Torus
3-D Hypercube
27
Network Terminology
  • Network Latency Time taken to begin sending a
    message. Unit is microsecond, millisecond etc.
    Smaller is better.
  • Network Bandwidth Rate at which data is
    transferred from one point to another. Unit is
    bytes/sec, Mbytes/sec etc. Larger is better.
  • May vary with data size

For IBM Blue Horizon
28
Network Terminology
  • Bus
  • Shared data path
  • Data requests require exclusive access
  • Complexity O(N)
  • Not scalable Bandwidth O(1)
  • Crossbar Switch
  • Non-blocking switching grid among network
    elements
  • Bandwidth O(N)
  • Complexity O(NN)
  • Multistage Interconnection Network (MIN)
  • Hierarchy of switching networks e.g., Omega
    network for N CPUs, N memory banks complexity
    O(ln(N))

29
Network Terminology (Continued)
  • Diameter maximum distance (in nodes) between
    any two processors
  • Connectivity number of distinct paths between
    any two processors
  • Channel width maximum number of bits that can
    be simultaneously sent over link connecting two
    processors number of physical wires in each
    link
  • Channel rate peak rate at which data can be
    sent over single physical wire
  • Channel bandwidth peak rate at which data can
    be sent over link (channel rate) (channel
    width)
  • Bisection width minimum number of links that
    have to be removed to partition network into two
    equal halves
  • Bisection bandwidth maximum amount of data
    between any two halves of network connecting
    equal numbers of CPUs (bisection width)
    (channel bandwidth)

30
Communication Overhead
  • Time to send a message of M bytes simple form
  • Tcomm TL MTd TContention
  • TL Message Latency
  • T 1byte/bandwidth
  • Tcontention Takes into account other network
    traffic

31
Communication Overhead
  • Time to send a message of M bytes simple form
  • Tcomm TL MTd TContention
  • TL Message Latency
  • T 1byte/bandwidth
  • Tcontention Takes into account other network
    traffic

32
Parallel I/O
  • I/O can be limiting factor in parallel
    application
  • I/O system properties capacity, bandwidth,
    access time
  • Need support for Parallel I/O in programming
    system
  • Need underlying HW and system support for
    parallel I/O
  • IBM GPFS low-level API for developing
    high-level parallel I/O functionality MPI I/O,
    HDF 5, etc.

33
Unix OS Concepts for Parallel Programming
  • Most Operating Systems used by Parallel Computers
    are Unix-based
  • Unix Process (task)
  • Executable code
  • Instruction pointer
  • Stack
  • Logical registers
  • Heap
  • Private address space
  • Task forking to create dependent processes
    thousands of clock cycles
  • Thread lightweight process
  • Logical registers
  • Stack
  • Shared address space
  • Hundreds of clock cycles to create/destroy/synchro
    nize threads

34
Parallel Computer Architectures (Flynn Taxonomy)
Control Mechanism
SIMD
MIMD
Hybrid (SMP cluster)
distributed-memory
Memory Model
shared-memory
35
Hardware Architecture Models for Design of
Parallel Programs
  • Sequential computers - von Neumann model (RAM) is
    universal computational model
  • Parallel computers - no one model exists
  • Model must be sufficiently general to encapsulate
    hardware features of parallel systems
  • Programs designed from model must execute
    efficiently on real parallel systems

36
Designing and Building Parallel Applications
Donald Frederick frederik_at_sdsc.edu San Diego
Supercomputing Center
37
What is Parallel Computing?
  • Parallel computing the use of multiple computers
    or processors or processes concurrently working
    together on a common task.
  • Each processor/process works on its section of
    the problem
  • Processors/process are allowed to exchange
    information (data in local memory) with other
    processors/processes

Grid of Problem to be solved
CPU 1 works on this area of the problem
CPU 2 works on this area of the problem
exchange
y
CPU 3 works on this area of the problem
CPU 4 works on this area of the problem
exchange
x
38
Shared and Distributed Memory Systems
Mulitprocessor Shared memory - Single address
space. Processes have access to a pool of
shared memory. Single OS.
Multicomputer Distributed memory - Each
processor has its own local memory.
Processes usually do message passing to exchange
data among processors. Usually multiple Copies
of OS
39
Hybrid (SMP Clusters) System
  • Must/may use message-passing.
  • Single or multiple OS copies
  • Node-Local operations less costly
  • than off-node

40
Unix OS Concepts for Parallel Programming
  • Most Operating Systems used are Unix-based
  • Unix Process (task)
  • Executable code
  • Instruction pointer
  • Stack
  • Logical registers
  • Heap
  • Private address space
  • Task forking to create dependent processes
    thousands of clock cycles
  • Thread lightweight process
  • Logical registers
  • Stack
  • Shared address space
  • Hundreds of clock cycles to create/destroy/synchro
    nize threads

41
Generic Parallel Programming Models
  • Single Program Multiple Data Stream (SPMD)
  • Each CPU accesses same object code
  • Same application run on different data
  • Data exchange may be handled explicitly/implicitly
  • Natural model for SIMD machines
  • Most commonly used generic parallel programming
    model
  • Message-passing
  • Shared-memory
  • Usually uses process/task ID to differentiate
  • Focus of remainder of this section
  • Multiple Program Multiple Data Stream (MPMD)
  • Each CPU accesses different object code
  • Each CPU has only data/instructions needed
  • Natural model for MIMD machines

42
Parallel Architectures Mapping Hardware
Models to Programming Models
Control Mechanism
SIMD
MIMD
Hybrid (SMP cluster)
distributed-memory
Memory Model
shared-memory
Programming Model
SPMD
MPMD
43
Methods of Problem Decomposition for Parallel
Programming
  • Want to map (Problem Algorithms Data)
    Architecture
  • Conceptualize mapping via e.g., pseudocode
  • Realize mapping via programming language
  • Data Decomposition - data parallel program
  • Each processor performs the same task on
    different data
  • Example - grid problems
  • Task (Functional ) Decomposition - task parallel
    program
  • Each processor performs a different task
  • Example - signal processing adding/subtracting
    frequencies from spectrum
  • Other Decomposition methods


44
Designing and Building Parallel Applications
  • Generic Problem Architectures
  • Design and Construction Principles
  • Incorporate Computer Science Algorithms
  • Use Parallel Numerical Libraries Where Possible

45
Designing and Building Parallel Applications
  • Know when (not) to parallelize is very important
  • Cherri Pancakes Rules summarized
  • Frequency of Use
  • Execution Time
  • Resolution Needs
  • Problem Size

46
Categories of Parallel Problems
  • Generic Parallel Problem Architectures ( after
    G Fox)
  • Ideally Parallel (Embarrassingly Parallel,
    Job-Level Parallel)
  • Same application run on different data
  • Could be run on separate machines
  • Example Parameter Studies
  • Almost Ideally Parallel
  • Similar to Ideal case, but with minimum
    coordination required
  • Example Linear Monte Carlo calculations,
    integrals
  • Pipeline Parallelism
  • Problem divided into tasks that have to be
    completed sequentially
  • Can be transformed into partially sequential
    tasks
  • Example DSP filtering
  • Synchronous Parallelism
  • Each operation performed on all/most of data
  • Operations depend on results of prior operations
  • All processes must be synchronized at regular
    points
  • Example Modeling Atmospheric Dynamics
  • Loosely Synchronous Parallelism
  • similar to Synchronous case, but with minimum
    intermittent data sharing

47
Designing and Building Parallel Applications
  • Attributes of Parallel Algorithms
  • Concurrency - Many actions performed
    simultaneously
  • Modularity - Decomposition of complex entities
    into simpler components
  • Locality - Want high ratio of of local memory
    access to remote memory access
  • Usually want to minimize communication/computation
    ratio
  • Performance
  • Measures of algorithmic efficiency
  • Execution time
  • Complexity usually Execution Time
  • Scalability

48
Designing and Building Parallel Applications
  • Partitioning - Break down main task into smaller
    ones either identical or disjoint.
  • Communication phase - Determine communication
    patterns for task coordination, communication
    algorithms.
  • Agglomeration - Evaluate task and/or
    communication structures wrt performance and
    implementation costs. Tasks may be combined to
    improve performance or reduce communication
    costs.
  • Mapping - Tasks assigned to processors maximize
    processor utilization, minimize communication
    costs. Mapping may be either static or dynamic.
  • May have to iterate whole process until satisfied
    with expected performance
  • Consider writing application in parallel, using
    either SPMD message-passing or shared-memory
  • Implementation (software hardware) may require
    revisit, additional refinement or re-design

49
Designing and Building Parallel Applications
  • Partitioning
  • Geometric or Physical decomposition (Domain
    Decomposition) - partition data associated with
    problem
  • Functional (task) decomposition partition into
    disjoint tasks associated with problem
  • Divide and Conquer partition problem into two
    simpler problems of approximately equivalent
    size iterate to produce set of indivisible
    sub-problems

50
Generic Parallel Programming Software Systems
  • Message-Passing
  • Local tasks, each encapsulating local data
  • Explicit data exchange
  • Supports both SPMD and MPMD
  • Supports both task and data decomposition
  • Most commonly used
  • Process-based, but for performance, processes
    should be running on separate CPUs
  • Example API MPI, PVM Message-Passing libraries
  • MP systems, in particular, MPI, will be focus of
    remainder of workshop
  • Data Parallel
  • Usually SPMD
  • Supports data decomposition
  • Data mapping to cpus may be either
    implicit/explicit
  • Example HPF compiler
  • Shared-Memory
  • Tasks share common address space
  • No explicit transfer of data - supports both task
    and data decomposition
  • Can be SPMD, MPMD
  • Thread-based, but for performance, threads should
    be running on separate CPUs

51
Programming Methodologies - Practical Aspects
  • Bulk of parallel programs written in Fortran, C,
    or C
  • Generally, best compiler, tool support for
    parallel program development
  • Bulk of parallel programs use Message-Passing
    with MPI
  • Performance, portability, mature compilers,
    libraries for parallel program development
  • Data and/or tasks are split up onto different
    processors by
  • Distributing the data/tasks onto different CPUs,
    each with local memory (MPPs,MPI)
  • Distribute work of each loop to different CPUs
    (SMPs,OpenMP, Pthreads)
  • Hybrid - distribute data onto SMPs and then
    within each SMP distribute work of each loop (or
    task) to different CPUs within the box
    (SMP-Cluster, MPIOpenMP)

52
Typical Data Decomposition for Parallelism
  • Example Solve 2-D Wave Equation

Original partial differential equation
Finite Difference Approximation
y
x
53
Sending Data Between CPUs
Finite Difference Approximation
Sample Pseudo Code
if (taskid0) then li 1 ui 25 lj 1 uj
25 send(125)f(25,125) elseif
(taskid1)then .... elseif (taskid2)
then ... elseif(taskid3) then ... end if do j
lj,uj do i li,ui work on f(i,j) end
do end do
PE 0
PE 1
i1,25 j1,25
i1,25 j26,50
i1-25, j26
i26-50,j25
i25,j1-25
i25,j26-50
y
PE 3
PE 4
i26,j1-25
i26,j26-50
i26,50 j1,25
i26,50 j26,50
i1-25, j26
i26-50,j25
x
54
Typical Task Parallel Decomposition
Process 0
Process 1
Process 2
SPECTRUM OUT
SPECTRUM IN
Subtract Frequency f2
Subtract Frequency f1
Subtract Frequency f3
Signal processing

Use one processor for each independent task

Can use more processors if one is overloaded
v

55
Basics of Task Parallel Decomposition - SPMD
  • Same program will run on 2 different CPUs
  • Task decomposition analysis has defined 2 tasks
    (a and b) to be done by 2 CPUs

program.f initialize ... if TaskIDA then
do task a elseif TaskIDB then do task b end
if . end program
Task A Execution Stream
Task B Execution Stream
program.f Initialize do task a end program
program.f Initialize do task b end program
56
Multi-Level Task Parallelism
threads
  • Program tskpar
  • Implicit none
  • (declarations)
  • Do loop 1
  • par block
  • End task 1
  • (serial work)
  • Do loop 2
  • par block
  • End task 2
  • (serial work)

Program tskpar Implicit none (declarations) Do
loop 1 par block End task 1 (serial work) Do
loop 2 par block End task 2 (serial work)
Program tskpar Implicit none (declarations) Do
loop 1 par block End task 1 (serial work) Do
loop 2 par block End task 2 (serial work)
Program tskpar Implicit none (declarations) Do
loop 1 par block End task 1 (serial work) Do
loop 2 par block End task 2 (serial work)
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
Implementation MPI and OpenMP
network
57
Parallel Application Performance Concepts
  • Parallel Speedup
  • Parallel Efficiency
  • Parallel Overhead
  • Limits on Parallel Performance

58
Parallel Application Performance Concepts
  • Parallel Speedup - ratio of best sequential time
    to parallel execution time
  • S(n) ts/tp
  • Parallel Efficiency - fraction of time processors
    in use
  • E(n) ts/(tpn) S(n)/n
  • Parallel Overhead
  • Communication time (Message-Passing)
  • Process creation/synchronization (MP)
  • Extra code to support parallelism, such as Load
    Balancing
  • Thread creation/coordination time (SMP)
  • Limits on Parallel Performance

59
Parallel Application Performance Concepts
  • Parallel Speedup - ratio of best sequential time
    to parallel execution time
  • S(n) ts/tp
  • Parallel Efficiency - fraction of time processors
    in use
  • E(n) ts/(tpn) S(n)/n
  • Parallel Overhead
  • Communication time (Message-Passing)
  • Process creation/synchronization (MP)
  • Extra code to support parallelism, such as Load
    Balancing
  • Thread creation/coordination time (SMP)
  • Limits on Parallel Performance

60
Limits of Parallel Computing
  • Theoretical upper limits
  • Amdahls Law
  • Gustafsons Law
  • Practical limits
  • Communication overhead
  • Synchronization overhead
  • Extra operations necessary for parallel version
  • Other Considerations
  • Time used to re-write (existing) code

61
Parallel Computing - Theoretical Performance
Upper Limits
  • All parallel programs contain
  • Parallel sections
  • Serial sections
  • Serial sections limit the parallel performance
  • Amdahls Law provides a theoretical upper limit
    on parallel performance for size-constant
    problems

62
Amdahls Law
  • Amdahls Law places a strict limit on the speedup
    that can be realized by using multiple processors
  • Effect of multiple processors on run time for
    size-constant problems
  • Effect of multiple processors on parallel
    speedup, S
  • Where
  • fs serial fraction of code
  • fp parallel fraction of code
  • N number of processors
  • t1 sequential execution time

63
Amdahls Law
64
Amdahls Law (Continued)
65
Gustafsons Law
  • Consider scaling problem size as processor count
    increased
  • Ts serial execution time
  • Tp(N,W) parallel execution time for same
    problem, size W, on N CPUs
  • S(N,W) Speedup on problem size W, N CPUs
  • S(N,W) (Ts Tp(1,W) )/( Ts Tp(N,W) )
  • Consider case where Tp(N,W) WW/N
  • S(N,W) -gt (NTs NWW)/(NTs WW) -gt N
  • Gustafsons Law provides some hope for parallel
    applications to deliver
  • on the promise.

66
Parallel Programming Analysis - Example
  • Consider solving 2-D Poissons equation by
    iterative method on a regular grid with M points

67
Parallel Programming Concepts
  • Program must be correct and terminate for some
    input data set(s)
  • Race condition result(s) depends upon order in
    which processes/threads finish calculation(s).
    May or may not be problem, depending upon results
  • Deadlock Process/thread requests resource it
    will never get. To be avoided common problem in
    message-passing parallel programs

68
Other Considerations
  • Writing efficient parallel applications is
    usually more difficult than writing serial
    applications
  • Serial version may (may not) provide good
    starting point for parallel version
  • Communication, synchronization, etc., can limit
    parallel performance
  • Usually want to overlap communication and
    computation to minimize ratio of communication to
    computation time
  • Serial time can dominate
  • CPU computational load balance is important
  • Is it worth your time to rewrite existing
    application? Or create new one? Recall Cherri
    Pancakes Rules (simplified version).
  • Do the CPU and/or memory requirements justify
    parallelization?
  • Will the code be used enough times to justify
    parallelization?


69
Parallel Programming - Real Life
  • These are the main models in use today (circa
    2002)
  • New approaches languages, hardware, etc., are
    likely to arise as technology advances
  • Other combinations of these models are possible
  • Large applications will probably use more than
    one model
  • Shared memory model is closest to mathematical
    model of application
  • Scaling to large numbers of cpus is major issue

70
Parallel Computing
  • References
  • NPACI PCOMP web-page - www.npaci.edu/PCOMP
  • Selected HPC link collection - categorized,
    updated
  • Online Tutorials, Books
  • Designing and Building Parallel Programs, Ian
    Foster.
  • http//www-unix.mcs.anl.gov/dbpp/
  • NCSA Intro to MPI Tutorial http//pacont.ncsa.uiuc
    .edu8900/public/MPI/index.html
  • HLRS Parallel Programming Workshop
    http//www.hlrs.de/organization/par/par_prog_ws/
  • Books
  • Parallel Programming, B. Wilkinson, M. Allen
  • Computer Organization and Design, D. Patterson
    and J. L. Hennessy
  • Scalable Parallel Computing, K. Huang, Z. Xu

About PowerShow.com