Parallel Computer Architecture - PowerPoint PPT Presentation

About This Presentation

Parallel Computer Architecture


A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: – PowerPoint PPT presentation

Number of Views:2382
Avg rating:3.0/5.0
Slides: 100
Provided by: Shaaban
Learn more at:


Transcript and Presenter's Notes

Title: Parallel Computer Architecture

Parallel Computer Architecture
  • A parallel computer is a collection of processing
    elements that cooperate to solve large problems
  • Broad issues involved
  • Resource Allocation
  • Number of processing elements (PEs).
  • Computing power of each element.
  • Amount of physical memory used.
  • Data access, Communication and Synchronization
  • How the elements cooperate and communicate.
  • How data is transmitted between processors.
  • Abstractions and primitives for cooperation.
  • Performance and Scalability
  • Performance enhancement of parallelism Speedup.
  • Scalabilty of performance to larger

Exploiting Program Parallelism
The Need And Feasibility of Parallel Computing
  • Application demands More computing cycles
  • Scientific computing CFD, Biology, Chemistry,
    Physics, ...
  • General-purpose computing Video, Graphics, CAD,
    Databases, Transaction Processing, Gaming
  • Mainstream multithreaded programs, are similar to
    parallel programs
  • Technology Trends
  • Number of transistors on chip growing rapidly
  • Clock rates expected to go up but only slowly
  • Architecture Trends
  • Instruction-level parallelism is valuable but
  • Coarser-level parallelism, as in MPs, the most
    viable approach
  • Economics
  • Todays microprocessors have multiprocessor
    support eliminating the need for designing
    expensive custom PEs
  • Lower parallel system cost.
  • Multiprocessor systems to offer a cost-effective
    replacement of uniprocessor systems in mainstream

Scientific Computing Demand
Scientific Supercomputing Trends
  • Proving ground and driver for innovative
    architecture and advanced techniques
  • Market is much smaller relative to commercial
  • Dominated by vector machines starting in 70s
  • Meanwhile, microprocessors have made huge gains
    in floating-point performance
  • High clock rates.
  • Pipelined floating point units.
  • Instruction-level parallelism.
  • Effective use of caches.
  • Large-scale multiprocessors replace vector
  • Well under way already

Raw Uniprocessor Performance LINPACK
Raw Parallel Performance LINPACK
Parallelism in Microprocessor VLSI Generations
The Goal of Parallel Computing
  • Goal of applications in using parallel machines
  • Speedup (p processors)
  • For a fixed problem size (input data set),
    performance 1/time
  • Speedup fixed problem (p processors)

Elements of Modern Computers
Binding (Compile, Load)
Elements of Modern Computers
  • Computing Problems
  • Numerical Computing Science and technology
    numerical problems demand intensive integer and
    floating point computations.
  • Logical Reasoning Artificial intelligence (AI)
    demand logic inferences and symbolic
    manipulations and large space searches.
  • Algorithms and Data Structures
  • Special algorithms and data structures are needed
    to specify the computations and communication
    present in computing problems.
  • Most numerical algorithms are deterministic using
    regular data structures.
  • Symbolic processing may use heuristics or
    non-deterministic searches.
  • Parallel algorithm development requires
    interdisciplinary interaction.

Elements of Modern Computers
  • Hardware Resources
  • Processors, memory, and peripheral devices form
    the hardware core of a computer system.
  • Processor instruction set, processor
    connectivity, memory organization, influence the
    system architecture.
  • Operating Systems
  • Manages the allocation of resources to running
  • Mapping to match algorithmic structures with
    hardware architecture and vice versa processor
    scheduling, memory mapping, interprocessor
  • Parallelism exploitation at algorithm design,
    program writing, compilation, and run time.

Elements of Modern Computers
  • System Software Support
  • Needed for the development of efficient programs
    in high-level languages (HLLs.)
  • Assemblers, loaders.
  • Portable parallel programming languages
  • User interfaces and tools.
  • Compiler Support
  • Preprocessor compiler Sequential compiler and
    low-level library of the target parallel
  • Precompiler Some program flow analysis,
    dependence checking, limited optimizations for
    parallelism detection.
  • Parallelizing compiler Can automatically detect
    parallelism in source code and transform
    sequential code into parallel constructs.

Approaches to Parallel Programming
(a) Implicit Parallelism
(b) Explicit Parallelism
Evolution of Computer Architecture
I/E Instruction Fetch and Execute SIMD
Single Instruction stream over
Multiple Data streams MIMD Multiple
Instruction streams over Multiple
Data streams
Massively Parallel Processors
Parallel Architectures History
  • Historically, parallel architectures tied to
    programming models
  • Divergent architectures, with no predictable
    pattern of growth.

Programming Models
  • Programming methodology used in coding
  • Specifies communication and synchronization
  • Examples
  • Multiprogramming
  • No communication or synchronization at
    program level
  • Shared memory address space
  • Message passing
  • Explicit point to point communication
  • Data parallel
  • More regimented, global actions on data
  • Implemented with shared address space or message

Flynns 1972 Classification of Computer
  • Single Instruction stream over a Single Data
    stream (SISD) Conventional sequential
  • Single Instruction stream over Multiple Data
    streams (SIMD) Vector computers, array of
    synchronized processing elements.
  • Multiple Instruction streams and a Single Data
    stream (MISD) Systolic arrays for pipelined
  • Multiple Instruction streams over Multiple Data
    streams (MIMD) Parallel computers
  • Shared memory multiprocessors.
  • Multicomputers Unshared distributed memory,
    message-passing used instead.

Flynns Classification of Computer Architecture
  • Fig. 1.3 page 12
  • Advanced Computer Architecture Parallelism,
    Scalability, Programmability, Hwang, 1993.

Current Trends In Parallel Architectures
  • The extension of computer architecture to
    support communication and cooperation
  • OLD Instruction Set Architecture
  • NEW Communication Architecture
  • Defines
  • Critical abstractions, boundaries, and primitives
  • Organizational structures that implement
    interfaces (hardware or software)
  • Compilers, libraries and OS are important bridges

Modern Parallel ArchitectureLayered Framework
Shared Address Space Parallel Architectures
  • Any processor can directly reference any memory
  • Communication occurs implicitly as result of
    loads and stores
  • Convenient
  • Location transparency
  • Similar programming model to time-sharing in
  • Except processes run on different processors
  • Good throughput on multiprogrammed workloads
  • Naturally provided on a wide range of platforms
  • Wide range of scale few to hundreds of
  • Popularly known as shared memory machines or
  • Ambiguous Memory may be physically distributed
    among processors

Shared Address Space (SAS) Model
  • Process virtual address space plus one or more
    threads of control
  • Portions of address spaces of processes are shared
  • Writes to shared address visible to other
    threads (in other processes too)
  • Natural extension of the uniprocessor model
  • Conventional memory operations used for
  • Special atomic operations needed for
  • OS uses shared memory to coordinate processes

Models of Shared-Memory Multiprocessors
  • The Uniform Memory Access (UMA) Model
  • The physical memory is shared by all processors.
  • All processors have equal access to all memory
  • Distributed memory or Nonuniform Memory Access
    (NUMA) Model
  • Shared memory is physically distributed locally
    among processors.
  • The Cache-Only Memory Architecture (COMA) Model
  • A special case of a NUMA machine where all
    distributed main memory is converted to caches.
  • No memory hierarchy at each processor.

Models of Shared-Memory Multiprocessors
Uniform Memory Access (UMA) Model
Interconnect Bus, Crossbar, Multistage
network P Processor M Memory C Cache D Cache
Distributed memory or Nonuniform Memory Access
(NUMA) Model
Cache-Only Memory Architecture (COMA)
Uniform Memory Access Example Intel Pentium Pro
  • All coherence and multiprocessing glue in
    processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

Uniform Memory Access Example SUN Enterprise
  • 16 cards of either type processors memory, or
  • All memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

Distributed Shared-Memory Multiprocessor System
Example Cray T3E
  • Scale up to 1024 processors, 480MB/s links
  • Memory controller generates communication
    requests for nonlocal references
  • No hardware mechanism for coherence (SGI Origin
    etc. provide this)

Message-Passing Multicomputers
  • Comprised of multiple autonomous computers
  • Each node consists of a processor, local memory,
    attached storage and I/O peripherals.
  • Programming model more removed from basic
    hardware operations
  • Local memory is only accessible by local
  • A message passing network provides point-to-point
    static connections among the nodes.
  • Inter-node communication is carried out by
    message passing through the static connection
  • Process communication achieved using a
    message-passing programming environment.

Message-Passing Abstraction
  • Send specifies buffer to be transmitted and
    receiving process
  • Recv specifies sending process and application
    storage to receive into
  • Memory to memory copy, but need to name processes
  • Optional tag on send and matching rule on receive
  • User process names local data and entities in
    process/tag space too
  • In simplest form, the send/recv match achieves
    pairwise synch event
  • Many overheads copying, buffer management,

Message-Passing Example IBM SP-2
  • Made out of essentially complete RS6000
  • Network interface integrated in I/O bus
    (bandwidth limited by I/O bus)

Message-Passing Example Intel Paragon
Message-Passing Programming Tools
  • Message-passing programming libraries include
  • Message Passing Interface (MPI)
  • Provides a standard for writing concurrent
    message-passing programs.
  • MPI implementations include parallel libraries
    used by existing programming languages.
  • Parallel Virtual Machine (PVM)
  • Enables a collection of heterogeneous computers
    to used as a coherent and flexible concurrent
    computational resource.
  • PVM support software executes on each machine in
    a user-configurable pool, and provides a
    computational environment of concurrent
  • User programs written for example in C, Fortran
    or Java are provided access to PVM through the
    use of calls to PVM library routines.

Data Parallel Systems SIMD in Flynn taxonomy
  • Programming model
  • Operations performed in parallel on each element
    of data structure
  • Logically single thread of control, performs
    sequential or parallel steps
  • Conceptually, a processor is associated with each
    data element
  • Architectural model
  • Array of many simple, cheap processors each with
    little memory
  • Processors dont sequence through instructions
  • Attached to a control processor that issues
  • Specialized and general communication, cheap
    global synchronization
  • Some recent machines
  • Thinking Machines CM-1, CM-2 (and CM-5)
  • Maspar MP-1 and MP-2,

Dataflow Architectures
  • Represent computation as a graph of essential
  • Logical processor at each node, activated by
    availability of operands
  • Message (tokens) carrying tag of next instruction
    sent to next processor
  • Tag compared with others in matching store match
    fires execution

Systolic Architectures
  • Replace single processor with an array of regular
    processing elements
  • Orchestrate data flow for high throughput with
    less memory access
  • Different from pipelining
  • Nonlinear array structure, multidirection data
    flow, each PE may have (small) local instruction
    and data memory
  • Different from SIMD each PE may do something
  • Initial motivation VLSI enables inexpensive
    special-purpose chips
  • Represent algorithms directly by chips connected
    in regular pattern

Parallel Programs
  • Conditions of Parallelism
  • Data Dependence
  • Control Dependence
  • Resource Dependence
  • Bernsteins Conditions
  • Asymptotic Notations for Algorithm Analysis
  • Parallel Random-Access Machine (PRAM)
  • Example sum algorithm on P processor PRAM
  • Network Model of Message-Passing Multicomputers
  • Example Asynchronous Matrix Vector Product on a
  • Levels of Parallelism in Program Execution
  • Hardware Vs. Software Parallelism
  • Parallel Task Grain Size
  • Example Motivating Problems With high levels of
  • Limited Concurrency Amdahls Law
  • Parallel Performance Metrics Degree of
    Parallelism (DOP)
  • Concurrency Profile
  • Steps in Creating a Parallel Program
  • Decomposition, Assignment, Orchestration, Mapping

Conditions of Parallelism Data Dependence
  • True Data or Flow Dependence A statement S2 is
    data dependent on statement S1 if an execution
    path exists from S1 to S2 and if at least one
    output variable of S1 feeds in as an input
    operand used by S2
  • denoted by S1 ¾ S2
  • Antidependence Statement S2 is antidependent on
    S1 if S2 follows S1 in program order and if the
    output of S2 overlaps the input of S1
  • denoted by S1 ¾ S2
  • Output dependence Two statements are output
    dependent if they produce the same output
  • denoted by S1 o¾ S2

Conditions of Parallelism Data Dependence
  • I/O dependence Read and write are I/O
    statements. I/O dependence occurs not because the
    same variable is involved but because the same
    file is referenced by both I/O statements.
  • Unknown dependence
  • Subscript of a variable is subscribed (indirect
  • The subscript does not contain the loop index.
  • A variable appears more than once with subscripts
    having different coefficients of the loop
  • The subscript is nonlinear in the loop index

Data and I/O Dependence Examples
  • A -
  • B -

S1 Load R1,A S2 Add R2, R1 S3 Move R1,
R3 S4 Store B, R1
Dependence graph
S1 Read (4),A(I) /Read array A from tape unit
4/ S2 Rewind (4) /Rewind tape unit 4/ S3 Write
(4), B(I) /Write array B into tape unit
4/ S4 Rewind (4) /Rewind tape unit 4/
I/O dependence caused by accessing the same file
by the read and write statements
Conditions of Parallelism
  • Control Dependence
  • Order of execution cannot be determined before
    runtime due to conditional statements.
  • Resource Dependence
  • Concerned with conflicts in using shared
    resources including functional units (integer,
    floating point), memory areas, among parallel
  • Bernsteins Conditions
  • Two processes P1 , P2 with input sets I1, I2
    and output sets O1, O2 can execute in parallel
    (denoted by P1 P2) if
  • I1 Ç O2 Æ
  • I2 Ç O1 Æ
  • O1 Ç O2 Æ

Bernsteins Conditions An Example
  • For the following instructions P1, P2, P3, P4, P5
    in program order and
  • Instructions are in program order
  • Each instruction requires one step to execute
  • Two adders are available
  • P1 C D x E
  • P2 M G C
  • P3 A B C
  • P4 C L M
  • P5 F G E

Using Bernsteins Conditions after checking
statement pairs P1 P5 , P2 P3
, P2 P5 , P5 P3 , P4
Parallel execution in three steps assuming two
adders are available per step
Dependence graph Data dependence (solid
lines) Resource dependence (dashed lines)
Sequential execution
Asymptotic Notations for Algorithm Analysis
  • Asymptotic analysis of computing time of an
    algorithm f(n) ignores constant execution factors
    and concentrates on determining the order of
    magnitude of algorithm performance.
  • Upper bound

    Used in worst case analysis of algorithm
  • f(n)
  • iff there exist two positive constants c
    and n0 such that
  • f(n) c g(n)
    for all n gt n0
  • Þ i.e. g(n) an upper bound on
  • O(1) lt O(log n) lt O(n) lt O(n log n) lt
    O (n2) lt O(n3) lt O(2n)

Asymptotic Notations for Algorithm Analysis
  • Lower bound

    Used in the analysis of the lower limit of
    algorithm performance
  • f(n)
  • if there exist positive constants c,
    n0 such that
  • f(n) ³ c
    g(n) for all n gt n0
  • Þ i.e. g(n) is a lower
    bound on f(n)
  • Tight bound
  • Used in finding a tight limit on algorithm
  • f(n) Q
  • if there exist constant positive
    integers c1, c2, and n0 such that
  • c1 g(n) f(n)
    c2 g(n) for all n gt n0
  • Þ i.e. g(n) is both an upper
    and lower bound on f(n)

The Growth Rate of Common Computing Functions
log n n n log n n2
n3 2n 0 1 0
1 1
2 1 2 2 4
8 4 2 4 8
16 64 16 3
8 24 64 512
256 4 16 64 256 4096
65536 5 32 160 1024
32768 4294967296
Theoretical Models of Parallel Computers
  • Parallel Random-Access Machine (PRAM)
  • n processor, global shared memory model.
  • Models idealized parallel computers with zero
    synchronization or memory access overhead.
  • Utilized parallel algorithm development and
    scalability and complexity analysis.
  • PRAM variants More realistic models than pure
  • EREW-PRAM Simultaneous memory reads or writes
    to/from the same memory location are not allowed.
  • CREW-PRAM Simultaneous memory writes to the
    same location is not allowed.
  • ERCW-PRAM Simultaneous reads from the same
    memory location are not allowed.
  • CRCW-PRAM Concurrent reads or writes to/from
    the same memory location are allowed.

Example sum algorithm on P processor PRAM
begin 1. for j 1 to l ( n/p) do Set
B(l(s - 1) j) A(l(s-1) j) 2. for h 1 to
log n do 2.1 if (k- h - q ³ 0) then
for j 2k-h-q(s-1) 1 to 2k-h-qS do
Set B(j) B(2j -1) B(2s)
2.2 else if (s 2k-h) then
Set B(s) B(2s -1 ) B(2s) 3. if (s 1) then
set S B(1) end
  • Input Array A of size n 2k
  • in shared memory
  • Initialized local variables
  • the order n,
  • number of processors p 2q n,
  • the processor number s
  • Output The sum of the elements
  • of A stored in shared memory
  • Running time analysis
  • Step 1 takes O(n/p) each processor executes
    n/p operations
  • The hth of step 2 takes O(n / (2hp)) since each
    processor has
  • to perform (n / (2hp)) Ø operations
  • Step three takes O(1)
  • Total Running time

Example Sum Algorithm on P Processor PRAM
For n 8 p 4 Processor allocation for
computing the sum of 8 elements on 4 processor
5 4 3 2 1
  • Operation represented by a node
  • is executed by the processor
  • indicated below the node.

Time Unit
The Power of The PRAM Model
  • Well-developed techniques and algorithms to
    handle many computational problems exist for the
    PRAM model
  • Removes algorithmic details regarding
    synchronization and communication, concentrating
    on the structural properties of the problem.
  • Captures several important parameters of parallel
    computations. Operations performed in unit time,
    as well as processor allocation.
  • The PRAM design paradigms are robust and many
    network algorithms can be directly derived from
    PRAM algorithms.
  • It is possible to incorporate synchronization and
    communication into the shared-memory PRAM model.

Performance of Parallel Algorithms
  • Performance of a parallel algorithm is typically
    measured in terms of worst-case analysis.
  • For problem Q with a PRAM algorithm that runs in
    time T(n) using P(n) processors, for an instance
    size of n
  • The time-processor product C(n) T(n) . P(n)
    represents the cost of the parallel algorithm.
  • For P lt P(n), each of the of the T(n) parallel
    steps is simulated in O(P(n)/p) substeps. Total
    simulation takes O(T(n)P(n)/p)
  • The following four measures of performance are
    asymptotically equivalent
  • P(n) processors and T(n) time
  • C(n) P(n)T(n) cost and T(n) time
  • O(T(n)P(n)/p) time for any number of processors p
    lt P(n)
  • O(C(n)/p T(n)) time for any number of

Network Model of Message-Passing Multicomputers
  • A network of processors can viewed as a graph G
  • Each node i Î N represents a processor
  • Each edge (i,j) Î E represents a two-way
    communication link between processors i and j.
  • Each processor is assumed to have its own local
  • No shared memory is available.
  • Operation is synchronous or asynchronous(message
  • Typical message-passing communication constructs
  • send(X,i) a copy of X is sent to processor Pi,
    execution continues.
  • receive(Y, j) execution suspended until the data
    from processor Pj is received and stored in Y
    then execution resumes.

Network Model of Multicomputers
  • Routing is concerned with delivering each message
    from source to destination over the network.
  • Additional important network topology parameters
  • The network diameter is the maximum distance
    between any pair of nodes.
  • The maximum degree of any node in G
  • Example
  • Linear array P processors P1, , Pp are
    connected in a linear array where
  • Processor Pi is connected to Pi-1 and Pi1 if
    they exist.
  • Diameter is p-1 maximum degree is 2
  • A ring is a linear array of processors where
    processors P1 and Pp are directly connected.

A Four-Dimensional Hypercube
  • Two processors are connected if their binary
    indices differ in one bit position.

Example Asynchronous Matrix Vector Product on a
  • Input
  • n x n matrix A vector x of order n
  • The processor number i. The number of
  • The ith submatrix B A( 1n, (i-1)r 1 ir) of
    size n x r where r n/p
  • The ith subvector w x(i - 1)r 1 ir) of size
  • Output
  • Processor Pi computes the vector y A1x1 .
    Aixi and passes the result to the right
  • Upon completion P1 will hold the product Ax
  • Begin
  • 1. Compute the matrix vector product z Bw
  • 2. If i 1 then set y 0
  • else receive(y,left)
  • 3. Set y y z
  • 4. send(y, right)
  • 5. if i 1 then receive(y,left)
  • End

Creating a Parallel Program
  • Assumption Sequential algorithm to solve
    problem is given
  • Or a different algorithm with more inherent
    parallelism is devised.
  • Most programming problems have several parallel
    solutions. The best solution may differ from that
    suggested by existing sequential algorithms.
  • One must
  • Identify work that can be done in parallel
  • Partition work and perhaps data among processes
  • Manage data access, communication and
  • Note work includes computation, data access and
  • Main goal Speedup (plus low programming
    effort and resource needs)
  • Speedup (p)
  • For a fixed problem
  • Speedup (p)

Some Important Concepts
  • Task
  • Arbitrary piece of undecomposed work in parallel
  • Executed sequentially on a single processor
    concurrency is only across tasks
  • E.g. a particle/cell in Barnes-Hut, a ray or ray
    group in Raytrace
  • Fine-grained versus coarse-grained tasks
  • Process (thread)
  • Abstract entity that performs the tasks assigned
    to processes
  • Processes communicate and synchronize to perform
    their tasks
  • Processor
  • Physical engine on which process executes
  • Processes virtualize machine to programmer
  • first write program in terms of processes, then
    map to processors

Levels of Parallelism in Program Execution

Coarse Grain

Increasing communications demand and
mapping/scheduling overhead
Higher degree of Parallelism
Medium Grain

Fine Grain
Hardware and Software Parallelism
  • Hardware parallelism
  • Defined by machine architecture, hardware
    multiplicity (number of processors available) and
  • Often a function of cost/performance tradeoffs.
  • Characterized in a single processor by the number
    of instructions k issued in a single cycle
    (k-issue processor).
  • A multiprocessor system with n k-issue
    processor can handle a maximum limit of nk
  • Software parallelism
  • Defined by the control and data dependence of
  • Revealed in program profiling or program flow
  • A function of algorithm, programming style and
    compiler optimization.

Computational Parallelism and Grain Size
  • Grain size (granularity) is a measure of the
    amount of computation involved in a task in
    parallel computation
  • Instruction Level
  • At instruction or statement level.
  • 20 instructions grain size or less.
  • For scientific applications, parallelism at this
    level range from 500 to 3000 concurrent
  • Manual parallelism detection is difficult but
    assisted by parallelizing compilers.
  • Loop level
  • Iterative loop operations.
  • Typically, 500 instructions or less per
  • Optimized on vector parallel computers.
  • Independent successive loop operations can be
    vectorized or run in SIMD mode.

Computational Parallelism and Grain Size
  • Procedure level
  • Medium-size grain task, procedure, subroutine
  • Less than 2000 instructions.
  • More difficult detection of parallel than
    finer-grain levels.
  • Less communication requirements than fine-grain
  • Relies heavily on effective operating system
  • Subprogram level
  • Job and subprogram level.
  • Thousands of instructions per grain.
  • Often scheduled on message-passing
  • Job (program) level, or Multiprogrammimg
  • Independent programs executed on a parallel
  • Grain size in tens of thousands of instructions.

Example Motivating Problems Simulating Ocean
  • Model as two-dimensional grids
  • Discretize in space and time
  • finer spatial and temporal resolution gt greater
  • Many different computations per time step
  • set up and solve equations
  • Concurrency across and within grid computations

Example Motivating Problems Simulating Galaxy
  • Simulate the interactions of many stars evolving
    over time
  • Computing forces is expensive
  • O(n2) brute force approach
  • Hierarchical Methods take advantage of force law

  • Many time-steps, plenty of concurrency across
    stars within one

Example Motivating Problems Rendering Scenes
by Ray Tracing
  • Shoot rays into scene through pixels in image
  • Follow their paths
  • They bounce around as they strike objects
  • They generate new rays ray tree per input ray
  • Result is color and opacity for that pixel
  • Parallelism across rays
  • All above case studies have abundant concurrency

Limited Concurrency Amdahls Law
  • Most fundamental limitation on parallel speedup.
  • If fraction s of seqeuential execution is
    inherently serial, speedup lt 1/s
  • Example 2-phase calculation
  • sweep over n-by-n grid and do some independent
  • sweep again and add each value to global sum
  • Time for first phase n2/p
  • Second phase serialized at global variable, so
    time n2
  • Speedup lt or at most 2
  • Possible Trick divide second phase into two
  • Accumulate into private sum during sweep
  • Add per-process private sum into global sum
  • Parallel time is n2/p n2/p p, and speedup
    at best

Amdahls Law ExampleA Pictorial Depiction
Parallel Performance MetricsDegree of
Parallelism (DOP)
  • For a given time period, DOP reflects the number
    of processors in a specific parallel computer
    actually executing a particular parallel
  • Average Parallelism
  • given maximum parallelism m
  • n homogeneous processors
  • computing capacity of a single processor D
  • Total amount of work W (instructions,
  • or as a
    discrete summation

The average parallelism A
In discrete form
Example Concurrency Profile of A
Divide-and-Conquer Algorithm
  • Execution observed from t1 2 to t2 27
  • Peak parallelism m 8
  • A (1x5 2x3 3x4 4x6 5x2 6x2 8x3) /
    (5 346223)
  • 93/25 3.72

Degree of Parallelism (DOP)
Parallel Performance Example
  • The execution time T for three parallel programs
    is given in terms of processor count P and
    problem size N
  • In each case, we assume that the total
    computation work performed by an
    optimal sequential algorithm scales as NN2 .
  • For first parallel algorithm T N N2/P
  • This algorithm partitions the
    computationally demanding O(N2) component of the
    algorithm but replicates the O(N) component on
    every processor. There are no other sources of
  • For the second parallel algorithm T (NN2
    )/P 100
  • This algorithm optimally divides all the
    computation among all processors but introduces
    an additional cost of 100.
  • For the third parallel algorithm T (NN2
    )/P 0.6P2
  • This algorithm also partitions all the
    computation optimally but introduces an
    additional cost of 0.6P2.
  • All three algorithms achieve a speedup of about
    10.8 when P 12 and N100 . However, they
    behave differently in other situations as shown
  • With N100 , all three algorithms perform poorly
    for larger P , although Algorithm (3) does
    noticeably worse than the other two.
  • When N1000 , Algorithm (2) is much better than
    Algorithm (1) for larger P .

Parallel Performance Example (continued)
All algorithms achieve Speedup 10.8 when P
12 and N100
N1000 , Algorithm (2) performs much better
than Algorithm (1) for larger P .
Algorithm 1 T N N2/P Algorithm 2 T
(NN2 )/P 100 Algorithm 3 T (NN2 )/P
Steps in Creating a Parallel Program
  • 4 steps
  • Decomposition, Assignment, Orchestration,
  • Done by programmer or system software (compiler,
    runtime, ...)
  • Issues are the same, so assume programmer does it
    all explicitly

  • Break up computation into concurrent tasks to be
    divided among processes
  • Tasks may become available dynamically.
  • No. of available tasks may vary with time.
  • Together with assignment, also called
  • i.e. identify concurrency and decide level
    at which to exploit it.
  • Grain-size problem
  • To determine the number and size of grains or
    tasks in a parallel program.
  • Problem and machine-dependent.
  • Solutions involve tradeoffs between parallelism,
    communication and scheduling/synchronization
  • Grain packing
  • To combine multiple fine-grain nodes into a
    coarse grain node (task) to reduce communication
    delays and overall scheduling overhead.
  • Goal Enough tasks to keep processes busy, but
    not too many
  • No. of tasks available at a time is upper bound
    on achievable speedup

  • Specifying mechanisms to divide work up among
  • Together with decomposition, also called
  • Balance workload, reduce communication and
    management cost
  • Partitioning problem
  • To partition a program into parallel branches,
    modules to give the shortest possible execution
    on a specific parallel architecture.
  • Structured approaches usually work well
  • Code inspection (parallel loops) or understanding
    of application.
  • Well-known heuristics.
  • Static versus dynamic assignment.
  • As programmers, we worry about partitioning
  • Usually independent of architecture or
    programming model.
  • But cost and complexity of using primitives may
    affect decisions.

  • Naming data.
  • Structuring communication.
  • Synchronization.
  • Organizing data structures and scheduling tasks
  • Goals
  • Reduce cost of communication and synch. as seen
    by processors
  • Reserve locality of data reference (incl. data
    structure organization)
  • Schedule tasks to satisfy dependences early
  • Reduce overhead of parallelism management
  • Closest to architecture (and programming model
  • Choices depend a lot on comm. abstraction,
    efficiency of primitives.
  • Architects should provide appropriate primitives

  • Each task is assigned to a processor in a manner
    that attempts to satisfy the competing goals of
    maximizing processor utilization and minimizing
    communication costs.
  • Mapping can be specified statically or determined
    at runtime by load-balancing algorithms (dynamic
  • Two aspects of mapping
  • Which processes will run on the same processor,
    if necessary
  • Which process runs on which particular processor
  • mapping to a network topology
  • One extreme space-sharing
  • Machine divided into subsets, only one app at a
    time in a subset
  • Processes can be pinned to processors, or left to
  • Another extreme complete resource management
    control to OS
  • OS uses the performance techniques we will
    discuss later.
  • Real world is between the two.
  • User specifies desires in some aspects, system
    may ignore

Program Partitioning Example
Example 2.4 page 64 Fig 2.6 page 65 Fig 2.7 page
66 In Advanced Computer Architecture, Hwang
Static Multiprocessor Scheduling
Dynamic multiprocessor scheduling is an NP-hard
problem. Node Duplication to eliminate idle
time and communication delays, some nodes may be
duplicated in more than one processor.
Fig. 2.8 page 67 Example 2.5 page 68 In
Advanced Computer Architecture, Hwang
(No Transcript)
Successive Refinement
  • Partitioning is often independent of
    architecture, and may be
  • done first
  • View machine as a collection of communicating
  • Balancing the workload.
  • Reducing the amount of inherent communication
  • Reducing extra work.
  • Above three issues are conflicting.
  • Then deal with interactions with architecture
  • View machine as an extended memory hierarchy
  • Extra communication due to architectural
  • Cost of communication depends on how it is
  • This may inspire changes in partitioning.

Partitioning for Performance
  • Balancing the workload and reducing wait time at
    synch points
  • Reducing inherent communication.
  • Reducing extra work.
  • These algorithmic issues have extreme trade-offs
  • Minimize communication gt run on 1 processor.

  • gt extreme load imbalance.
  • Maximize load balance gt random assignment of
    tiny tasks.
  • gt no
    control over communication.
  • Good partition may imply extra work to compute or
    manage it
  • The goal is to compromise between the above
  • Fortunately, often not difficult in practice.

Load Balancing and Synch Wait Time Reduction
  • Limit on speedup
  • Work includes data access and other costs.
  • Not just equal work, but must be busy at same
  • Four parts to load balancing and reducing synch
    wait time
  • 1. Identify enough concurrency.
  • 2. Decide how to manage it.
  • 3. Determine the granularity at which to exploit
  • 4. Reduce serialization and cost of

Managing Concurrency
  • Static versus Dynamic techniques
  • Static
  • Algorithmic assignment based on input wont
  • Low runtime overhead
  • Computation must be predictable
  • Preferable when applicable (except in
    multiprogrammed/heterogeneous environment)
  • Dynamic
  • Adapt at runtime to balance load
  • Can increase communication and reduce locality
  • Can increase task management overheads

Dynamic Load Balancing
  • To achieve best performance of a parallel
    computing system running a parallel problem,
    its essential to maximize processor utilization
    by distributing the computation load evenly or
    balancing the load among the available
  • Optimal static load balancing, optimal mapping or
    scheduling, is an intractable NP-complete
    problem, except for specific problems on specific
  • Hence heuristics are usually used to select
    processors for processes.
  • Even the best static mapping may offer the best
    execution time due to changing conditions at
    runtime and the process may need to done
  • The methods used for balancing the computational
    load dynamically among processors can be broadly
    classified as
  • 1. Centralized dynamic load balancing.
  • 2. Decentralized dynamic load balancing.

Processor Load Balance Performance
Dynamic Tasking with Task Queues
  • Centralized versus distributed queues.
  • Task stealing with distributed queues.
  • Can compromise communication and locality, and
    increase synchronization.
  • Whom to steal from, how many tasks to steal, ...
  • Termination detection
  • Maximum imbalance related to size of task

Implications of Load Balancing
  • Extends speedup limit expression to
  • Speedupproblem(p)
  • Generally, responsibility of software
  • Architecture can support task stealing and synch
  • Fine-grained communication, low-overhead access
    to queues
  • Efficient support allows smaller tasks, better
    load balancing
  • Naming logically shared data in the presence of
    task stealing
  • Need to access data of stolen tasks, esp.
    multiply-stolen tasks
  • gt Hardware shared address space advantageous
  • Efficient support for point-to-point

Reducing Inherent Communication
  • Measure communication to computation ratio
  • Focus here is on inherent communication
  • Determined by assignment of tasks to processes
  • Actual communication can be greater
  • Assign tasks that access same data to same
  • Optimal solution to reduce communication and
    achive an optimal load balance is NP-hard in the
    general case
  • Simple heuristic solutions work well in practice
  • Due to specific structure of applications.

Implications of Communication-to-Computation Ratio
  • Architects must examine application needs
  • If denominator is execution time, ratio gives
    average BW needs
  • If operation count, gives extremes in impact of
    latency and bandwidth
  • Latency assume no latency hiding
  • Bandwidth assume all latency hidden
  • Reality is somewhere in between
  • Actual impact of communication depends on
    structure and cost as well
  • Need to keep communication balanced across
    processors as well.

Reducing Extra Work (Overheads)
  • Common sources of extra work
  • Computing a good partition
  • e.g. partitioning in Barnes-Hut or sparse
  • Using redundant computation to avoid
  • Task, data and process management overhead
  • Applications, languages, runtime systems, OS
  • Imposing structure on communication
  • Coalescing messages, allowing effective naming
  • Architectural Implications
  • Reduce need by making communication and
    orchestration efficient

Extended Memory-Hierarchy View of Multiprocessors
  • Levels in extended hierarchy
  • Registers, caches, local memory, remote memory
  • Glued together by communication architecture
  • Levels communicate at a certain granularity of
    data transfer
  • Need to exploit spatial and temporal locality in
  • Otherwise extra communication may also be caused
  • Especially important since communication is

Extended Hierarchy
  • Idealized view local cache hierarchy single
    main memory
  • But reality is more complex
  • Centralized Memory caches of other processors
  • Distributed Memory some local, some remote
    network topology
  • Management of levels
  • Caches managed by hardware
  • Main memory depends on programming model
  • SAS data movement between local and remote
  • Message passing explicit
  • Improve performance through architecture or
    program locality
  • Tradeoff with parallelism need good node
    performance and parallelism

Artifactual Communication in Extended Hierarchy
  • Accesses not satisfied in local portion cause
  • Inherent communication, implicit or explicit,
    causes transfers
  • Determined by program
  • Artifactual communication
  • Determined by program implementation and arch.
  • Poor allocation of data across distributed
  • Unnecessary data in a transfer
  • Unnecessary transfers due to system granularities
  • Redundant communication of data
  • finite replication capacity (in cache or main
  • Inherent communication assumes unlimited
    capacity, small transfers, perfect knowledge of
    what is needed.
  • More on artifactual communication later first
    consider replication-induced further

Structuring Communication
  • Given amount of comm (inherent or artifactual),
    goal is to reduce cost
  • Cost of communication as seen by process
  • C f ( o l tc -
  • f frequency of messages
  • o overhead per message (at both ends)
  • l network delay per message
  • nc total data sent
  • m number of messages
  • B bandwidth along path (determined by network,
    NI, assist)
  • tc cost induced by contention per message
  • overlap amount of latency hidden by overlap
    with comp. or comm.
  • Portion in parentheses is cost of a message (as
    seen by processor)
  • That portion, ignoring overlap, is latency of a
  • Goal reduce terms in latency and increase

Reducing Overhead
  • Can reduce no. of messages m or overhead per
    message o
  • o is usually determined by hardware or system
  • Program should try to reduce m by coalescing
  • More control when communication is explicit
  • Coalescing data into larger messages
  • Easy for regular, coarse-grained communication
  • Can be difficult for irregular, naturally
    fine-grained communication
  • May require changes to algorithm and extra work
  • coalescing data and determining what and to whom
    to send
  • Will discuss more in implications for programming
    models later

Reducing Network Delay
  • Network delay component fhth
  • h number of hops traversed in network
  • th linkswitch latency per hop
  • Reducing f Communicate less, or make messages
  • Reducing h
  • Map communication patterns to network topology
  • e.g. nearest-neighbor on mesh and ring
  • How important is this?
  • Used to be a major focus of parallel algorithms
  • Depends on no. of processors, how th, compares
    with other components
  • Less important on modern machines
  • Overheads, processor count, multiprogramming

Overlapping Communication
  • Cannot afford to stall for high latencies
  • Overlap with computation or communication to hide
  • Requires extra concurrency (slackness), higher
  • Techniques
  • Prefetching
  • Block data transfer
  • Proceeding past communication
  • Multithreading

Summary of Tradeoffs
  • Different goals often have conflicting demands
  • Load Balance
  • Fine-grain tasks
  • Random or dynamic assignment
  • Communication
  • Usually coarse grain tasks
  • Decompose to obtain locality not random/dynamic
  • Extra Work
  • Coarse grain tasks
  • Simple assignment
  • Communication Cost
  • Big transfers amortize overhead and latency
  • Small transfers reduce contention

Relationship Between Perspectives
  • Speedupprob(p)
  • Goal is to reduce denominator components
  • Both programmer and system have role to play
  • Architecture cannot do much about load imbalance
    or too much communication
  • But it can
  • reduce incentive for creating ill-behaved
    programs (efficient naming, communication and
  • reduce artifactual communication
  • provide efficient naming for flexible assignment
  • allow effective overlapping of communication

Generic Distributed Memory Organization
OS Supported? Network protocols?
Multi-stage interconnection network
(MIN)? Custom-designed?
Global virtual Shared address space?
Message transaction DMA?
  • Network bandwidth?
  • Bandwidth demand?
  • Independent processes?
  • Communicating processes?
  • Latency? O(log2P) increase?
  • Cost scalability of system?

Node O(10) Bus-based SMP
Custom-designed CPU? Node/System integration
level? How far? Cray-on-a-Chip? SMP-on-a-Chip?
Write a Comment
User Comments (0)