Lecture%2029:%20Parallel%20Programming%20Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture%2029:%20Parallel%20Programming%20Overview

Description:

Lecture 29: Parallel Programming Overview – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 40
Provided by: Fie53
Learn more at: http://www.cs.uni.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture%2029:%20Parallel%20Programming%20Overview


1
Lecture 29 Parallel Programming Overview
2
Parallel Programming Paradigms --Various Methods
  • There are many methods of programming parallel
    computers. Two of the most common are message
    passing and data parallel.
  • Message Passing - the user makes calls to
    libraries to explicitly share information between
    processors.
  • Data Parallel - data partitioning determines
    parallelism
  • Shared Memory - multiple processes sharing common
    memory space
  • Remote Memory Operation - set of processes in
    which a process can access the memory of another
    process without its participation
  • Threads - a single process having multiple
    (concurrent) execution paths
  • Combined Models - composed of two or more of the
    above.
  • Note these models are machine/architecture
    independent, any of the models can be implemented
    on any hardware given appropriate operating
    system support. An effective implementation is
    one which closely matches its target hardware and
    provides the user ease in programming.

3
Parallel Programming Paradigms Message Passing
  • The message passing model is defined as
  • set of processes using only local memory
  • processes communicate by sending and receiving
    messages
  • data transfer requires cooperative operations to
    be performed by each process (a send operation
    must have a matching receive)
  • Programming with message passing is done by
    linking with and making calls to libraries which
    manage the data exchange between processors.
    Message passing libraries are available for most
    modern programming languages.

4
Parallel Programming Paradigms Data Parallel
  • The data parallel model is defined as
  • Each process works on a different part of the
    same data structure
  • Commonly a Single Program Multiple Data (SPMD)
    approach
  • Data is distributed across processors
  • All message passing is done invisibly to the
    programmer
  • Commonly built "on top of" one of the common
    message passing libraries
  • Programming with data parallel model is
    accomplished by writing a program with data
    parallel constructs and compiling it with a data
    parallel compiler.
  • The compiler converts the program into standard
    code and calls to a message passing library to
    distribute the data to all the processes.

5
Implementation of Message Passing MPI
  • Message Passing Interface often called MPI.
  • A standard portable message-passing library
    definition developed in 1993 by a group of
    parallel computer vendors, software writers, and
    application scientists.
  • Available to both Fortran and C programs.
  • Available on a wide variety of parallel machines.
  • Target platform is a distributed memory system
  • All inter-task communication is by message
    passing.
  • All parallelism is explicit the programmer is
    responsible for parallelism the program and
    implementing the MPI constructs.
  • Programming model is SPMD (Single Program
    Multiple Data)

6
Implementations F90 / High Performance Fortran
(HPF)
  • Fortran 90 (F90) - (ISO / ANSI standard
    extensions to Fortran 77).
  • High Performance Fortran (HPF) - extensions to
    F90 to support data parallel programming.
  • Compiler directives allow programmer
    specification of data distribution and alignment.
  • New compiler constructs and intrinsics allow the
    programmer to do computations and manipulations
    on data with different distributions.

7
Steps for Creating a Parallel Program
  • If you are starting with an existing serial
    program, debug the serial code completely
  • Identify the parts of the program that can be
    executed concurrently
  • Requires a thorough understanding of the
    algorithm
  • Exploit any inherent parallelism which may exist.
  • May require restructuring of the program and/or
    algorithm. May require an entirely new algorithm.
  • Decompose the program
  • Functional Parallelism
  • Data Parallelism
  • Combination of both
  • Code development
  • Code may be influenced/determined by machine
    architecture
  • Choose a programming paradigm
  • Determine communication
  • Add code to accomplish task control and
    communications
  • Compile, Test, Debug
  • Optimization
  • Measure Performance
  • Locate Problem Areas
  • Improve them

8
Amdahls Law
  • Speedup due to enhancement E is
  • Suppose that enhancement E accelerates a fraction
    F (F lt1) of the task by a factor S (Sgt1) and
    the remainder of the task is unaffected

ExTime w/ E ExTime w/o E ? ((1-F) F/S)
Speedup w/ E 1 / ((1-F) F/S)
9
Examples Amdahls Law
  • Amdahls Law tells us that to achieve linear
    speedup with 100 processors (e.g., speedup of
    100), none of the original computation can be
    scalar!
  • To get a speedup of 99 from 100 processors, the
    percentage of the original program that could be
    scalar would have to be 0.01 or less
  • What speedup could we achieve from 100 processors
    if 30 of the original program is scalar?

Speedup w/ E 1 / ((1-F) F/S)
  • 1
    / (0.7 0.7/100)
  • 1.4
  • Serial program/algorithm might need to be
    restructuring to allow for efficient
    parallelization.

10
Decomposing the Program
  • There are three methods for decomposing a problem
    into smaller tasks to be performed in parallel
    Functional Decomposition, Domain Decomposition,
    or a combination of both
  • Functional Decomposition (Functional Parallelism)
  • Decomposing the problem into different tasks
    which can be distributed to multiple processors
    for simultaneous execution
  • Good to use when there is not static structure or
    fixed determination of number of calculations to
    be performed
  • Domain Decomposition (Data Parallelism)
  • Partitioning the problem's data domain and
    distributing portions to multiple processors for
    simultaneous execution
  • Good to use for problems where
  • data is static (factoring and solving large
    matrix or finite difference calculations)
  • dynamic data structure tied to single entity
    where entity can be subsetted (large multi-body
    problems)
  • domain is fixed but computation within various
    regions of the domain is dynamic (fluid vortices
    models)
  • There are many ways to decompose data into
    partitions to be distributed
  • One Dimensional Data Distribution
  • Block Distribution
  • Cyclic Distribution
  • Two Dimensional Data Distribution
  • Block Block Distribution
  • Block Cyclic Distribution
  • Cyclic Block Distribution

11
Functional Decomposing of a Program
  • Decomposing the problem into different tasks
    which can be distributed to multiple processors
    for simultaneous execution
  • Good to use when there is not static structure or
    fixed determination of number of calculations to
    be performed

12
Functional Decomposing of a Program
13
Domain Decomposition (Data Parallelism)
  • Partitioning the problem's data domain and
    distributing portions to multiple processors for
    simultaneous execution
  • There are many ways to decompose data into
    partitions to be distributed

14
Summing 100,000 Numbers on 100 Processors
  • Start by distributing 1000 elements of vector A
    to each of the local memories and summing each
    subset in parallel
  • sum 0
  • for (i 0 ilt1000 i i 1)
  • sum sum Ali / sum local array subset
  • The processors then coordinate in adding together
    the sub sums (Pn is the number of the processor,
    send(x,y) sends value y to processor x, and
    receive() receives a value)

half 100 limit 100 repeat half
(half1)/2 /dividing line if (Pngt half
Pnltlimit) send(Pn-half,sum) if (Pnlt(limit/2))
sum sum receive() limit half until
(half 1) /final sum in P0s sum
15
An Example with 10 Processors
half 10
16
An Example with 10 Processors
half 10
send
limit 10
receive
half 5
limit 5
send
receive
half 3
limit 3
send
receive
half 2
limit 2
send
half 1
receive
17
Domain Decomposition (Data Parallelism)
  • Partitioning the problem's data domain and
    distributing portions to multiple processors for
    simultaneous execution
  • There are many ways to decompose data into
    partitions to be distributed

18
Cannon's Matrix Multiplication
19
Review Multiprocessor Basics
  • Q1 How do they share data?
  • Q2 How do they coordinate?
  • Q3 How scalable is the architecture? How many
    processors?

of Proc
Communication model Message passing Message passing 8 to 2048
Communication model Shared address NUMA 8 to 256
Communication model Shared address UMA 2 to 64
Physical connection Network Network 8 to 256
Physical connection Bus Bus 2 to 36
20
Review Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O
  • Caches are used to reduce latency and to lower
    bus traffic
  • Must provide hardware for cache coherence and
    process synchronization
  • Bus traffic and bandwidth limits scalability (lt
    36 processors)

21
Network Connected Multiprocessors
  • Either a single address space (NUMA and ccNUMA)
    with implicit processor communication via loads
    and stores or multiple private memories with
    message passing communication with sends and
    receives
  • Interconnection network supports interprocessor
    communication

22
Communication in Network Connected Multis
  • Implicit communication via loads and stores
  • hardware designers have to provide coherent
    caches and process synchronization primitive
  • lower communication overhead
  • harder to overlap computation with communication
  • more efficient to use an address to remote data
    when demanded rather than to send for it in case
    it might be used (such a machine has distributed
    shared memory (DSM))
  • Explicit communication via sends and receives
  • simplest solution for hardware designers
  • higher communication overhead
  • easier to overlap computation with communication
  • easier for the programmer to optimize
    communication

23
Cache Coherency in NUMAs
  • For performance reasons we want to allow the
    shared data to be stored in caches
  • Once again have multiple copies of the same data
    with the same address in different processors
  • bus snooping wont work, since there is no single
    bus on which all memory references are broadcast
  • Directory-base protocols
  • keep a directory that is a repository for the
    state of every block in main memory (which caches
    have copies, whether it is dirty, etc.)
  • directory entries can be distributed (sharing
    status of a block always in a single known
    location) to reduce contention
  • directory controller sends explicit commands over
    the IN to each processor that has a copy of the
    data

24
IN Performance Metrics
  • Network cost
  • number of switches
  • number of (bidirectional) links on a switch to
    connect to the network (plus one link to connect
    to the processor)
  • width in bits per link, length of link
  • Network bandwidth (NB) represents the best case
  • bandwidth of each link number of links
  • Bisection bandwidth (BB) represents the worst
    case
  • divide the machine in two parts, each with half
    the nodes and sum the bandwidth of the links that
    cross the dividing line
  • Other IN performance issues
  • latency on an unloaded network to send and
    receive messages
  • throughput maximum of messages transmitted
    per unit time
  • routing hops worst case, congestion control and
    delay

25
Bus IN
Bidirectional network switch
Processor node
  • N processors, 1 switch ( ), 1 link (the
    bus)
  • Only 1 simultaneous transfer at a time
  • NB link (bus) bandwidth 1
  • BB link (bus) bandwidth 1

26
Ring IN
  • N processors, N switches, 2 links/switch, N links
  • N simultaneous transfers
  • NB link bandwidth N
  • BB link bandwidth 2
  • If a link is as fast as a bus, the ring is only
    twice as fast as a bus in the worst case, but is
    N times faster in the best case

27
Fully Connected IN
  • N processors, N switches, N-1 links/switch,
    (N(N-1))/2 links
  • N simultaneous transfers
  • NB link bandwidth (N(N-1))/2
  • BB link bandwidth (N/2)2

28
Crossbar (Xbar) Connected IN
  • N processors, N2 switches (unidirectional),2
    links/switch, N2 links
  • N simultaneous transfers
  • NB link bandwidth N
  • BB link bandwidth N/2

29
Hypercube (Binary N-cube) Connected IN
2-cube
  • N processors, N switches, logN links/switch,
    (NlogN)/2 links
  • N simultaneous transfers
  • NB link bandwidth (NlogN)/2
  • BB link bandwidth N/2

30
2D and 3D Mesh/Torus Connected IN
  • N processors, N switches, 2, 3, 4 (2D torus) or 6
    (3D torus) links/switch, 4N/2 links or 6N/2 links
  • N simultaneous transfers
  • NB link bandwidth 4N or link
    bandwidth 6N
  • BB link bandwidth 2 N1/2 or link
    bandwidth 2 N2/3

31
Fat Tree
  • Trees are good structures. People in CS use them
    all the time. Suppose we wanted to make a tree
    network.

C
D
A
B
  • Any time A wants to send to C, it ties up the
    upper links, so that B can't send to D.
  • The bisection bandwidth on a tree is horrible - 1
    link, at all times
  • The solution is to 'thicken' the upper links.
  • More links as the tree gets thicker increases the
    bisection
  • Rather than design a bunch of N-port switches,
    use pairs

32
Fat Tree
  • N processors, log(N-1)logN switches, 2 up 4
    down 6 links/switch, NlogN links
  • N simultaneous transfers
  • NB link bandwidth NlogN
  • BB link bandwidth 4

33
SGI NUMAlink Fat Tree
www.embedded-computing.com/articles/woodacre
34
IN Comparison
  • For a 64 processor system

Bus Ring Torus 6-cube Fully connected
Network bandwidth 1
Bisection bandwidth 1
Total of Switches 1
Links per switch
Total of links 1
35
IN Comparison
  • For a 64 processor system

Bus Ring 2D Torus 6-cube Fully connected
Network bandwidth 1
Bisection bandwidth 1
Total of switches 1
Links per switch
Total of links (bidi) 1
64 2 64 21 6464
256 16 64 41 12864
192 32 64 67 19264
2016 1024 64 631 201664
36
Network Connected Multiprocessors
Proc Proc Speed Proc IN Topology BW/link (MB/sec)
SGI Origin R16000 128 fat tree 800
Cray 3TE Alpha 21164 300MHz 2,048 3D torus 600
Intel ASCI Red Intel 333MHz 9,632 mesh 800
IBM ASCI White Power3 375MHz 8,192 multistage Omega 500
NEC ES SX-5 500MHz 6408 640-xbar 16000
NASA Columbia Intel Itanium2 1.5GHz 51220 fat tree, Infiniband
IBM BG/L Power PC 440 0.7GHz 65,5362 3D torus, fat tree, barrier
37
IBM BlueGene
512-node proto BlueGene/L
Peak Perf 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s
Memory Size 128 GByte 16 / 32 TByte
Foot Print 9 sq feet 2500 sq feet
Total Power 9 KW 1.5 MW
Processors 512 dual proc 65,536 dual proc
Networks 3D Torus, Tree, Barrier 3D Torus, Tree, Barrier
Torus BW 3 B/cycle 3 B/cycle
38
A BlueGene/L Chip
11GB/s
32K/32K L1 440 CPU Double FPU
2KB L2
4MB L3 ECC eDRAM 128B
line 8-way assoc
128
256
5.5 GB/s
16KB Multiport SRAM buffer
256
700 MHz
256
32K/32K L1 440 CPU Double FPU
2KB L2
128
256
5.5 GB/s
11GB/s
Gbit ethernet
DDR control
3D torus
Fat tree
Barrier
1
8
6 in, 6 out 1.6GHz 1.4Gb/s link
3 in, 3 out 350MHz 2.8Gb/s link
4 global barriers
144b DDR 256MB 5.5GB/s
39
Networks of Workstations (NOWs) Clusters
  • Clusters of off-the-shelf, whole computers with
    multiple private address spaces
  • Clusters are connected using the I/O bus of the
    computers
  • lower bandwidth that multiprocessor that use the
    memory bus
  • lower speed network links
  • more conflicts with I/O traffic
  • Clusters of N processors have N copies of the OS
    limiting the memory available for applications
  • Improved system availability and expandability
  • easier to replace a machine without bringing down
    the whole system
  • allows rapid, incremental expandability
  • Economy-of-scale advantages with respect to costs

40
Commercial (NOW) Clusters
Proc Proc Speed Proc Network
Dell PowerEdge P4 Xeon 3.06GHz 2,500 Myrinet
eServer IBM SP Power4 1.7GHz 2,944
VPI BigMac Apple G5 2.3GHz 2,200 Mellanox Infiniband
HP ASCI Q Alpha 21264 1.25GHz 8,192 Quadrics
LLNL Thunder Intel Itanium2 1.4GHz 1,0244 Quadrics
Barcelona PowerPC 970 2.2GHz 4,536 Myrinet
41
Summary
  • Flynns classification of processors - SISD,
    SIMD, MIMD
  • Q1 How do processors share data?
  • Q2 How do processors coordinate their activity?
  • Q3 How scalable is the architecture (what is
    the maximum number of processors)?
  • Shared address multis UMAs and NUMAs
  • Scalability of bus connected UMAs limited (lt 36
    processors)
  • Network connected NUMAs more scalable
  • Interconnection Networks (INs)
  • fully connected, xbar
  • ring
  • mesh
  • n-cube, fat tree
  • Message passing multis
  • Cluster connected (NOWs) multis
Write a Comment
User Comments (0)
About PowerShow.com