Parallel Computer Architecture - PowerPoint PPT Presentation

Loading...

PPT – Parallel Computer Architecture PowerPoint presentation | free to download - id: 755308-ZmU5Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel Computer Architecture

Description:

A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 100
Provided by: Shaaban
Learn more at: http://meseec.ce.rit.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel Computer Architecture


1
Parallel Computer Architecture
  • A parallel computer is a collection of processing
    elements that cooperate to solve large problems
    fast
  • Broad issues involved
  • Resource Allocation
  • Number of processing elements (PEs).
  • Computing power of each element.
  • Amount of physical memory used.
  • Data access, Communication and Synchronization
  • How the elements cooperate and communicate.
  • How data is transmitted between processors.
  • Abstractions and primitives for cooperation.
  • Performance and Scalability
  • Performance enhancement of parallelism Speedup.
  • Scalabilty of performance to larger
    systems/problems.

2
Exploiting Program Parallelism
3
The Need And Feasibility of Parallel Computing
  • Application demands More computing cycles
  • Scientific computing CFD, Biology, Chemistry,
    Physics, ...
  • General-purpose computing Video, Graphics, CAD,
    Databases, Transaction Processing, Gaming
  • Mainstream multithreaded programs, are similar to
    parallel programs
  • Technology Trends
  • Number of transistors on chip growing rapidly
  • Clock rates expected to go up but only slowly
  • Architecture Trends
  • Instruction-level parallelism is valuable but
    limited
  • Coarser-level parallelism, as in MPs, the most
    viable approach
  • Economics
  • Todays microprocessors have multiprocessor
    support eliminating the need for designing
    expensive custom PEs
  • Lower parallel system cost.
  • Multiprocessor systems to offer a cost-effective
    replacement of uniprocessor systems in mainstream
    computing.

4
Scientific Computing Demand
5
Scientific Supercomputing Trends
  • Proving ground and driver for innovative
    architecture and advanced techniques
  • Market is much smaller relative to commercial
    segment
  • Dominated by vector machines starting in 70s
  • Meanwhile, microprocessors have made huge gains
    in floating-point performance
  • High clock rates.
  • Pipelined floating point units.
  • Instruction-level parallelism.
  • Effective use of caches.
  • Large-scale multiprocessors replace vector
    supercomputers
  • Well under way already

6
Raw Uniprocessor Performance LINPACK
7
Raw Parallel Performance LINPACK
8
Parallelism in Microprocessor VLSI Generations
9
The Goal of Parallel Computing
  • Goal of applications in using parallel machines
    Speedup
  • Speedup (p processors)
  • For a fixed problem size (input data set),
    performance 1/time
  • Speedup fixed problem (p processors)

10
Elements of Modern Computers
Mapping
Programming
Binding (Compile, Load)
11
Elements of Modern Computers
  • Computing Problems
  • Numerical Computing Science and technology
    numerical problems demand intensive integer and
    floating point computations.
  • Logical Reasoning Artificial intelligence (AI)
    demand logic inferences and symbolic
    manipulations and large space searches.
  • Algorithms and Data Structures
  • Special algorithms and data structures are needed
    to specify the computations and communication
    present in computing problems.
  • Most numerical algorithms are deterministic using
    regular data structures.
  • Symbolic processing may use heuristics or
    non-deterministic searches.
  • Parallel algorithm development requires
    interdisciplinary interaction.

12
Elements of Modern Computers
  • Hardware Resources
  • Processors, memory, and peripheral devices form
    the hardware core of a computer system.
  • Processor instruction set, processor
    connectivity, memory organization, influence the
    system architecture.
  • Operating Systems
  • Manages the allocation of resources to running
    processes.
  • Mapping to match algorithmic structures with
    hardware architecture and vice versa processor
    scheduling, memory mapping, interprocessor
    communication.
  • Parallelism exploitation at algorithm design,
    program writing, compilation, and run time.

13
Elements of Modern Computers
  • System Software Support
  • Needed for the development of efficient programs
    in high-level languages (HLLs.)
  • Assemblers, loaders.
  • Portable parallel programming languages
  • User interfaces and tools.
  • Compiler Support
  • Preprocessor compiler Sequential compiler and
    low-level library of the target parallel
    computer.
  • Precompiler Some program flow analysis,
    dependence checking, limited optimizations for
    parallelism detection.
  • Parallelizing compiler Can automatically detect
    parallelism in source code and transform
    sequential code into parallel constructs.

14
Approaches to Parallel Programming
(a) Implicit Parallelism
(b) Explicit Parallelism
15
Evolution of Computer Architecture
I/E Instruction Fetch and Execute SIMD
Single Instruction stream over
Multiple Data streams MIMD Multiple
Instruction streams over Multiple
Data streams
Massively Parallel Processors
(MPPs)
16
Parallel Architectures History
  • Historically, parallel architectures tied to
    programming models
  • Divergent architectures, with no predictable
    pattern of growth.

17
Programming Models
  • Programming methodology used in coding
    applications
  • Specifies communication and synchronization
  • Examples
  • Multiprogramming
  • No communication or synchronization at
    program level
  • Shared memory address space
  • Message passing
  • Explicit point to point communication
  • Data parallel
  • More regimented, global actions on data
  • Implemented with shared address space or message
    passing

18
Flynns 1972 Classification of Computer
Architecture
  • Single Instruction stream over a Single Data
    stream (SISD) Conventional sequential
    machines.
  • Single Instruction stream over Multiple Data
    streams (SIMD) Vector computers, array of
    synchronized processing elements.
  • Multiple Instruction streams and a Single Data
    stream (MISD) Systolic arrays for pipelined
    execution.
  • Multiple Instruction streams over Multiple Data
    streams (MIMD) Parallel computers
  • Shared memory multiprocessors.
  • Multicomputers Unshared distributed memory,
    message-passing used instead.

19
Flynns Classification of Computer Architecture
  • Fig. 1.3 page 12
    in
  • Advanced Computer Architecture Parallelism,
    Scalability, Programmability, Hwang, 1993.

20
Current Trends In Parallel Architectures
  • The extension of computer architecture to
    support communication and cooperation
  • OLD Instruction Set Architecture
  • NEW Communication Architecture
  • Defines
  • Critical abstractions, boundaries, and primitives
    (interfaces)
  • Organizational structures that implement
    interfaces (hardware or software)
  • Compilers, libraries and OS are important bridges
    today

21
Modern Parallel ArchitectureLayered Framework
22
Shared Address Space Parallel Architectures
  • Any processor can directly reference any memory
    location
  • Communication occurs implicitly as result of
    loads and stores
  • Convenient
  • Location transparency
  • Similar programming model to time-sharing in
    uniprocessors
  • Except processes run on different processors
  • Good throughput on multiprogrammed workloads
  • Naturally provided on a wide range of platforms
  • Wide range of scale few to hundreds of
    processors
  • Popularly known as shared memory machines or
    model
  • Ambiguous Memory may be physically distributed
    among processors

23
Shared Address Space (SAS) Model
  • Process virtual address space plus one or more
    threads of control
  • Portions of address spaces of processes are shared
  • Writes to shared address visible to other
    threads (in other processes too)
  • Natural extension of the uniprocessor model
  • Conventional memory operations used for
    communication
  • Special atomic operations needed for
    synchronization
  • OS uses shared memory to coordinate processes

24
Models of Shared-Memory Multiprocessors
  • The Uniform Memory Access (UMA) Model
  • The physical memory is shared by all processors.
  • All processors have equal access to all memory
    addresses.
  • Distributed memory or Nonuniform Memory Access
    (NUMA) Model
  • Shared memory is physically distributed locally
    among processors.
  • The Cache-Only Memory Architecture (COMA) Model
  • A special case of a NUMA machine where all
    distributed main memory is converted to caches.
  • No memory hierarchy at each processor.

25
Models of Shared-Memory Multiprocessors
Uniform Memory Access (UMA) Model
Interconnect Bus, Crossbar, Multistage
network P Processor M Memory C Cache D Cache
directory
Distributed memory or Nonuniform Memory Access
(NUMA) Model
Cache-Only Memory Architecture (COMA)
26
Uniform Memory Access Example Intel Pentium Pro
Quad
  • All coherence and multiprocessing glue in
    processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

27
Uniform Memory Access Example SUN Enterprise
  • 16 cards of either type processors memory, or
    I/O
  • All memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

28
Distributed Shared-Memory Multiprocessor System
Example Cray T3E
  • Scale up to 1024 processors, 480MB/s links
  • Memory controller generates communication
    requests for nonlocal references
  • No hardware mechanism for coherence (SGI Origin
    etc. provide this)

29
Message-Passing Multicomputers
  • Comprised of multiple autonomous computers
    (nodes).
  • Each node consists of a processor, local memory,
    attached storage and I/O peripherals.
  • Programming model more removed from basic
    hardware operations
  • Local memory is only accessible by local
    processors.
  • A message passing network provides point-to-point
    static connections among the nodes.
  • Inter-node communication is carried out by
    message passing through the static connection
    network
  • Process communication achieved using a
    message-passing programming environment.

30
Message-Passing Abstraction
  • Send specifies buffer to be transmitted and
    receiving process
  • Recv specifies sending process and application
    storage to receive into
  • Memory to memory copy, but need to name processes
  • Optional tag on send and matching rule on receive
  • User process names local data and entities in
    process/tag space too
  • In simplest form, the send/recv match achieves
    pairwise synch event
  • Many overheads copying, buffer management,
    protection

31
Message-Passing Example IBM SP-2
  • Made out of essentially complete RS6000
    workstations
  • Network interface integrated in I/O bus
    (bandwidth limited by I/O bus)

32
Message-Passing Example Intel Paragon
33
Message-Passing Programming Tools
  • Message-passing programming libraries include
  • Message Passing Interface (MPI)
  • Provides a standard for writing concurrent
    message-passing programs.
  • MPI implementations include parallel libraries
    used by existing programming languages.
  • Parallel Virtual Machine (PVM)
  • Enables a collection of heterogeneous computers
    to used as a coherent and flexible concurrent
    computational resource.
  • PVM support software executes on each machine in
    a user-configurable pool, and provides a
    computational environment of concurrent
    applications.
  • User programs written for example in C, Fortran
    or Java are provided access to PVM through the
    use of calls to PVM library routines.

34
Data Parallel Systems SIMD in Flynn taxonomy
  • Programming model
  • Operations performed in parallel on each element
    of data structure
  • Logically single thread of control, performs
    sequential or parallel steps
  • Conceptually, a processor is associated with each
    data element
  • Architectural model
  • Array of many simple, cheap processors each with
    little memory
  • Processors dont sequence through instructions
  • Attached to a control processor that issues
    instructions
  • Specialized and general communication, cheap
    global synchronization
  • Some recent machines
  • Thinking Machines CM-1, CM-2 (and CM-5)
  • Maspar MP-1 and MP-2,

35
Dataflow Architectures
  • Represent computation as a graph of essential
    dependences
  • Logical processor at each node, activated by
    availability of operands
  • Message (tokens) carrying tag of next instruction
    sent to next processor
  • Tag compared with others in matching store match
    fires execution

36
Systolic Architectures
  • Replace single processor with an array of regular
    processing elements
  • Orchestrate data flow for high throughput with
    less memory access
  • Different from pipelining
  • Nonlinear array structure, multidirection data
    flow, each PE may have (small) local instruction
    and data memory
  • Different from SIMD each PE may do something
    different
  • Initial motivation VLSI enables inexpensive
    special-purpose chips
  • Represent algorithms directly by chips connected
    in regular pattern

37
Parallel Programs
  • Conditions of Parallelism
  • Data Dependence
  • Control Dependence
  • Resource Dependence
  • Bernsteins Conditions
  • Asymptotic Notations for Algorithm Analysis
  • Parallel Random-Access Machine (PRAM)
  • Example sum algorithm on P processor PRAM
  • Network Model of Message-Passing Multicomputers
  • Example Asynchronous Matrix Vector Product on a
    Ring
  • Levels of Parallelism in Program Execution
  • Hardware Vs. Software Parallelism
  • Parallel Task Grain Size
  • Example Motivating Problems With high levels of
    concurrency
  • Limited Concurrency Amdahls Law
  • Parallel Performance Metrics Degree of
    Parallelism (DOP)
  • Concurrency Profile
  • Steps in Creating a Parallel Program
  • Decomposition, Assignment, Orchestration, Mapping

38
Conditions of Parallelism Data Dependence
  • True Data or Flow Dependence A statement S2 is
    data dependent on statement S1 if an execution
    path exists from S1 to S2 and if at least one
    output variable of S1 feeds in as an input
    operand used by S2
  • denoted by S1 ¾ S2
  • Antidependence Statement S2 is antidependent on
    S1 if S2 follows S1 in program order and if the
    output of S2 overlaps the input of S1
  • denoted by S1 ¾ S2
  • Output dependence Two statements are output
    dependent if they produce the same output
    variable
  • denoted by S1 o¾ S2

39
Conditions of Parallelism Data Dependence
  • I/O dependence Read and write are I/O
    statements. I/O dependence occurs not because the
    same variable is involved but because the same
    file is referenced by both I/O statements.
  • Unknown dependence
  • Subscript of a variable is subscribed (indirect
    addressing)
  • The subscript does not contain the loop index.
  • A variable appears more than once with subscripts
    having different coefficients of the loop
    variable.
  • The subscript is nonlinear in the loop index
    variable.

40
Data and I/O Dependence Examples
  • A -
  • B -

S1 Load R1,A S2 Add R2, R1 S3 Move R1,
R3 S4 Store B, R1
Dependence graph
S1 Read (4),A(I) /Read array A from tape unit
4/ S2 Rewind (4) /Rewind tape unit 4/ S3 Write
(4), B(I) /Write array B into tape unit
4/ S4 Rewind (4) /Rewind tape unit 4/
I/O dependence caused by accessing the same file
by the read and write statements
41
Conditions of Parallelism
  • Control Dependence
  • Order of execution cannot be determined before
    runtime due to conditional statements.
  • Resource Dependence
  • Concerned with conflicts in using shared
    resources including functional units (integer,
    floating point), memory areas, among parallel
    tasks.
  • Bernsteins Conditions
  • Two processes P1 , P2 with input sets I1, I2
    and output sets O1, O2 can execute in parallel
    (denoted by P1 P2) if
  • I1 Ç O2 Æ
  • I2 Ç O1 Æ
  • O1 Ç O2 Æ

42
Bernsteins Conditions An Example
  • For the following instructions P1, P2, P3, P4, P5
    in program order and
  • Instructions are in program order
  • Each instruction requires one step to execute
  • Two adders are available
  • P1 C D x E
  • P2 M G C
  • P3 A B C
  • P4 C L M
  • P5 F G E

Using Bernsteins Conditions after checking
statement pairs P1 P5 , P2 P3
, P2 P5 , P5 P3 , P4
P5
Parallel execution in three steps assuming two
adders are available per step
Dependence graph Data dependence (solid
lines) Resource dependence (dashed lines)
Sequential execution
43
Asymptotic Notations for Algorithm Analysis
  • Asymptotic analysis of computing time of an
    algorithm f(n) ignores constant execution factors
    and concentrates on determining the order of
    magnitude of algorithm performance.
  • Upper bound

    Used in worst case analysis of algorithm
    performance.
  • f(n)
    O(g(n))
  • iff there exist two positive constants c
    and n0 such that
  • f(n) c g(n)
    for all n gt n0
  • Þ i.e. g(n) an upper bound on
    f(n)
  • O(1) lt O(log n) lt O(n) lt O(n log n) lt
    O (n2) lt O(n3) lt O(2n)

44
Asymptotic Notations for Algorithm Analysis
  • Lower bound

    Used in the analysis of the lower limit of
    algorithm performance
  • f(n)
    W(g(n))
  • if there exist positive constants c,
    n0 such that
  • f(n) ³ c
    g(n) for all n gt n0
  • Þ i.e. g(n) is a lower
    bound on f(n)
  • Tight bound
  • Used in finding a tight limit on algorithm
    performance
  • f(n) Q
    (g(n))
  • if there exist constant positive
    integers c1, c2, and n0 such that
  • c1 g(n) f(n)
    c2 g(n) for all n gt n0
  • Þ i.e. g(n) is both an upper
    and lower bound on f(n)

45
The Growth Rate of Common Computing Functions
log n n n log n n2
n3 2n 0 1 0
1 1
2 1 2 2 4
8 4 2 4 8
16 64 16 3
8 24 64 512
256 4 16 64 256 4096
65536 5 32 160 1024
32768 4294967296
46
Theoretical Models of Parallel Computers
  • Parallel Random-Access Machine (PRAM)
  • n processor, global shared memory model.
  • Models idealized parallel computers with zero
    synchronization or memory access overhead.
  • Utilized parallel algorithm development and
    scalability and complexity analysis.
  • PRAM variants More realistic models than pure
    PRAM
  • EREW-PRAM Simultaneous memory reads or writes
    to/from the same memory location are not allowed.
  • CREW-PRAM Simultaneous memory writes to the
    same location is not allowed.
  • ERCW-PRAM Simultaneous reads from the same
    memory location are not allowed.
  • CRCW-PRAM Concurrent reads or writes to/from
    the same memory location are allowed.

47
Example sum algorithm on P processor PRAM
begin 1. for j 1 to l ( n/p) do Set
B(l(s - 1) j) A(l(s-1) j) 2. for h 1 to
log n do 2.1 if (k- h - q ³ 0) then
for j 2k-h-q(s-1) 1 to 2k-h-qS do
Set B(j) B(2j -1) B(2s)
2.2 else if (s 2k-h) then
Set B(s) B(2s -1 ) B(2s) 3. if (s 1) then
set S B(1) end
  • Input Array A of size n 2k
  • in shared memory
  • Initialized local variables
  • the order n,
  • number of processors p 2q n,
  • the processor number s
  • Output The sum of the elements
  • of A stored in shared memory
  • Running time analysis
  • Step 1 takes O(n/p) each processor executes
    n/p operations
  • The hth of step 2 takes O(n / (2hp)) since each
    processor has
  • to perform (n / (2hp)) Ø operations
  • Step three takes O(1)
  • Total Running time

48
Example Sum Algorithm on P Processor PRAM
For n 8 p 4 Processor allocation for
computing the sum of 8 elements on 4 processor
PRAM
5 4 3 2 1
  • Operation represented by a node
  • is executed by the processor
  • indicated below the node.

Time Unit
49
The Power of The PRAM Model
  • Well-developed techniques and algorithms to
    handle many computational problems exist for the
    PRAM model
  • Removes algorithmic details regarding
    synchronization and communication, concentrating
    on the structural properties of the problem.
  • Captures several important parameters of parallel
    computations. Operations performed in unit time,
    as well as processor allocation.
  • The PRAM design paradigms are robust and many
    network algorithms can be directly derived from
    PRAM algorithms.
  • It is possible to incorporate synchronization and
    communication into the shared-memory PRAM model.

50
Performance of Parallel Algorithms
  • Performance of a parallel algorithm is typically
    measured in terms of worst-case analysis.
  • For problem Q with a PRAM algorithm that runs in
    time T(n) using P(n) processors, for an instance
    size of n
  • The time-processor product C(n) T(n) . P(n)
    represents the cost of the parallel algorithm.
  • For P lt P(n), each of the of the T(n) parallel
    steps is simulated in O(P(n)/p) substeps. Total
    simulation takes O(T(n)P(n)/p)
  • The following four measures of performance are
    asymptotically equivalent
  • P(n) processors and T(n) time
  • C(n) P(n)T(n) cost and T(n) time
  • O(T(n)P(n)/p) time for any number of processors p
    lt P(n)
  • O(C(n)/p T(n)) time for any number of
    processors.

51
Network Model of Message-Passing Multicomputers
  • A network of processors can viewed as a graph G
    (N,E)
  • Each node i Î N represents a processor
  • Each edge (i,j) Î E represents a two-way
    communication link between processors i and j.
  • Each processor is assumed to have its own local
    memory.
  • No shared memory is available.
  • Operation is synchronous or asynchronous(message
    passing).
  • Typical message-passing communication constructs
  • send(X,i) a copy of X is sent to processor Pi,
    execution continues.
  • receive(Y, j) execution suspended until the data
    from processor Pj is received and stored in Y
    then execution resumes.

52
Network Model of Multicomputers
  • Routing is concerned with delivering each message
    from source to destination over the network.
  • Additional important network topology parameters
  • The network diameter is the maximum distance
    between any pair of nodes.
  • The maximum degree of any node in G
  • Example
  • Linear array P processors P1, , Pp are
    connected in a linear array where
  • Processor Pi is connected to Pi-1 and Pi1 if
    they exist.
  • Diameter is p-1 maximum degree is 2
  • A ring is a linear array of processors where
    processors P1 and Pp are directly connected.

53
A Four-Dimensional Hypercube
  • Two processors are connected if their binary
    indices differ in one bit position.

54
Example Asynchronous Matrix Vector Product on a
Ring
  • Input
  • n x n matrix A vector x of order n
  • The processor number i. The number of
    processors
  • The ith submatrix B A( 1n, (i-1)r 1 ir) of
    size n x r where r n/p
  • The ith subvector w x(i - 1)r 1 ir) of size
    r
  • Output
  • Processor Pi computes the vector y A1x1 .
    Aixi and passes the result to the right
  • Upon completion P1 will hold the product Ax
  • Begin
  • 1. Compute the matrix vector product z Bw
  • 2. If i 1 then set y 0
  • else receive(y,left)
  • 3. Set y y z
  • 4. send(y, right)
  • 5. if i 1 then receive(y,left)
  • End

55
Creating a Parallel Program
  • Assumption Sequential algorithm to solve
    problem is given
  • Or a different algorithm with more inherent
    parallelism is devised.
  • Most programming problems have several parallel
    solutions. The best solution may differ from that
    suggested by existing sequential algorithms.
  • One must
  • Identify work that can be done in parallel
  • Partition work and perhaps data among processes
  • Manage data access, communication and
    synchronization
  • Note work includes computation, data access and
    I/O
  • Main goal Speedup (plus low programming
    effort and resource needs)
  • Speedup (p)
  • For a fixed problem
  • Speedup (p)

56
Some Important Concepts
  • Task
  • Arbitrary piece of undecomposed work in parallel
    computation
  • Executed sequentially on a single processor
    concurrency is only across tasks
  • E.g. a particle/cell in Barnes-Hut, a ray or ray
    group in Raytrace
  • Fine-grained versus coarse-grained tasks
  • Process (thread)
  • Abstract entity that performs the tasks assigned
    to processes
  • Processes communicate and synchronize to perform
    their tasks
  • Processor
  • Physical engine on which process executes
  • Processes virtualize machine to programmer
  • first write program in terms of processes, then
    map to processors

57
Levels of Parallelism in Program Execution

Coarse Grain

Increasing communications demand and
mapping/scheduling overhead
Higher degree of Parallelism
Medium Grain

Fine Grain
58
Hardware and Software Parallelism
  • Hardware parallelism
  • Defined by machine architecture, hardware
    multiplicity (number of processors available) and
    connectivity.
  • Often a function of cost/performance tradeoffs.
  • Characterized in a single processor by the number
    of instructions k issued in a single cycle
    (k-issue processor).
  • A multiprocessor system with n k-issue
    processor can handle a maximum limit of nk
    threads.
  • Software parallelism
  • Defined by the control and data dependence of
    programs.
  • Revealed in program profiling or program flow
    graph.
  • A function of algorithm, programming style and
    compiler optimization.

59
Computational Parallelism and Grain Size
  • Grain size (granularity) is a measure of the
    amount of computation involved in a task in
    parallel computation
  • Instruction Level
  • At instruction or statement level.
  • 20 instructions grain size or less.
  • For scientific applications, parallelism at this
    level range from 500 to 3000 concurrent
    statements
  • Manual parallelism detection is difficult but
    assisted by parallelizing compilers.
  • Loop level
  • Iterative loop operations.
  • Typically, 500 instructions or less per
    iteration.
  • Optimized on vector parallel computers.
  • Independent successive loop operations can be
    vectorized or run in SIMD mode.

60
Computational Parallelism and Grain Size
  • Procedure level
  • Medium-size grain task, procedure, subroutine
    levels.
  • Less than 2000 instructions.
  • More difficult detection of parallel than
    finer-grain levels.
  • Less communication requirements than fine-grain
    parallelism.
  • Relies heavily on effective operating system
    support.
  • Subprogram level
  • Job and subprogram level.
  • Thousands of instructions per grain.
  • Often scheduled on message-passing
    multicomputers.
  • Job (program) level, or Multiprogrammimg
  • Independent programs executed on a parallel
    computer.
  • Grain size in tens of thousands of instructions.

61
Example Motivating Problems Simulating Ocean
Currents
  • Model as two-dimensional grids
  • Discretize in space and time
  • finer spatial and temporal resolution gt greater
    accuracy
  • Many different computations per time step
  • set up and solve equations
  • Concurrency across and within grid computations

62
Example Motivating Problems Simulating Galaxy
Evolution
  • Simulate the interactions of many stars evolving
    over time
  • Computing forces is expensive
  • O(n2) brute force approach
  • Hierarchical Methods take advantage of force law
    G

m1m2
r2
  • Many time-steps, plenty of concurrency across
    stars within one

63
Example Motivating Problems Rendering Scenes
by Ray Tracing
  • Shoot rays into scene through pixels in image
    plane
  • Follow their paths
  • They bounce around as they strike objects
  • They generate new rays ray tree per input ray
  • Result is color and opacity for that pixel
  • Parallelism across rays
  • All above case studies have abundant concurrency

64
Limited Concurrency Amdahls Law
  • Most fundamental limitation on parallel speedup.
  • If fraction s of seqeuential execution is
    inherently serial, speedup lt 1/s
  • Example 2-phase calculation
  • sweep over n-by-n grid and do some independent
    computation
  • sweep again and add each value to global sum
  • Time for first phase n2/p
  • Second phase serialized at global variable, so
    time n2
  • Speedup lt or at most 2
  • Possible Trick divide second phase into two
  • Accumulate into private sum during sweep
  • Add per-process private sum into global sum
  • Parallel time is n2/p n2/p p, and speedup
    at best

65
Amdahls Law ExampleA Pictorial Depiction
66
Parallel Performance MetricsDegree of
Parallelism (DOP)
  • For a given time period, DOP reflects the number
    of processors in a specific parallel computer
    actually executing a particular parallel
    program.
  • Average Parallelism
  • given maximum parallelism m
  • n homogeneous processors
  • computing capacity of a single processor D
  • Total amount of work W (instructions,
    computations)
  • or as a
    discrete summation

The average parallelism A
In discrete form
67
Example Concurrency Profile of A
Divide-and-Conquer Algorithm
  • Execution observed from t1 2 to t2 27
  • Peak parallelism m 8
  • A (1x5 2x3 3x4 4x6 5x2 6x2 8x3) /
    (5 346223)
  • 93/25 3.72

Degree of Parallelism (DOP)
t2
68
Parallel Performance Example
  • The execution time T for three parallel programs
    is given in terms of processor count P and
    problem size N
  • In each case, we assume that the total
    computation work performed by an
    optimal sequential algorithm scales as NN2 .
  • For first parallel algorithm T N N2/P
  • This algorithm partitions the
    computationally demanding O(N2) component of the
    algorithm but replicates the O(N) component on
    every processor. There are no other sources of
    overhead.
  • For the second parallel algorithm T (NN2
    )/P 100
  • This algorithm optimally divides all the
    computation among all processors but introduces
    an additional cost of 100.
  • For the third parallel algorithm T (NN2
    )/P 0.6P2
  • This algorithm also partitions all the
    computation optimally but introduces an
    additional cost of 0.6P2.
  • All three algorithms achieve a speedup of about
    10.8 when P 12 and N100 . However, they
    behave differently in other situations as shown
    next.
  • With N100 , all three algorithms perform poorly
    for larger P , although Algorithm (3) does
    noticeably worse than the other two.
  • When N1000 , Algorithm (2) is much better than
    Algorithm (1) for larger P .

69
Parallel Performance Example (continued)
All algorithms achieve Speedup 10.8 when P
12 and N100
N1000 , Algorithm (2) performs much better
than Algorithm (1) for larger P .
Algorithm 1 T N N2/P Algorithm 2 T
(NN2 )/P 100 Algorithm 3 T (NN2 )/P
0.6P2
70
Steps in Creating a Parallel Program
  • 4 steps
  • Decomposition, Assignment, Orchestration,
    Mapping
  • Done by programmer or system software (compiler,
    runtime, ...)
  • Issues are the same, so assume programmer does it
    all explicitly

71
Decomposition
  • Break up computation into concurrent tasks to be
    divided among processes
  • Tasks may become available dynamically.
  • No. of available tasks may vary with time.
  • Together with assignment, also called
    partitioning.
  • i.e. identify concurrency and decide level
    at which to exploit it.
  • Grain-size problem
  • To determine the number and size of grains or
    tasks in a parallel program.
  • Problem and machine-dependent.
  • Solutions involve tradeoffs between parallelism,
    communication and scheduling/synchronization
    overhead.
  • Grain packing
  • To combine multiple fine-grain nodes into a
    coarse grain node (task) to reduce communication
    delays and overall scheduling overhead.
  • Goal Enough tasks to keep processes busy, but
    not too many
  • No. of tasks available at a time is upper bound
    on achievable speedup

72
Assignment
  • Specifying mechanisms to divide work up among
    processes
  • Together with decomposition, also called
    partitioning.
  • Balance workload, reduce communication and
    management cost
  • Partitioning problem
  • To partition a program into parallel branches,
    modules to give the shortest possible execution
    on a specific parallel architecture.
  • Structured approaches usually work well
  • Code inspection (parallel loops) or understanding
    of application.
  • Well-known heuristics.
  • Static versus dynamic assignment.
  • As programmers, we worry about partitioning
    first
  • Usually independent of architecture or
    programming model.
  • But cost and complexity of using primitives may
    affect decisions.

73
Orchestration
  • Naming data.
  • Structuring communication.
  • Synchronization.
  • Organizing data structures and scheduling tasks
    temporally.
  • Goals
  • Reduce cost of communication and synch. as seen
    by processors
  • Reserve locality of data reference (incl. data
    structure organization)
  • Schedule tasks to satisfy dependences early
  • Reduce overhead of parallelism management
  • Closest to architecture (and programming model
    language).
  • Choices depend a lot on comm. abstraction,
    efficiency of primitives.
  • Architects should provide appropriate primitives
    efficiently.

74
Mapping
  • Each task is assigned to a processor in a manner
    that attempts to satisfy the competing goals of
    maximizing processor utilization and minimizing
    communication costs.
  • Mapping can be specified statically or determined
    at runtime by load-balancing algorithms (dynamic
    scheduling).
  • Two aspects of mapping
  • Which processes will run on the same processor,
    if necessary
  • Which process runs on which particular processor
  • mapping to a network topology
  • One extreme space-sharing
  • Machine divided into subsets, only one app at a
    time in a subset
  • Processes can be pinned to processors, or left to
    OS.
  • Another extreme complete resource management
    control to OS
  • OS uses the performance techniques we will
    discuss later.
  • Real world is between the two.
  • User specifies desires in some aspects, system
    may ignore

75
Program Partitioning Example
Example 2.4 page 64 Fig 2.6 page 65 Fig 2.7 page
66 In Advanced Computer Architecture, Hwang
76
Static Multiprocessor Scheduling
Dynamic multiprocessor scheduling is an NP-hard
problem. Node Duplication to eliminate idle
time and communication delays, some nodes may be
duplicated in more than one processor.
Fig. 2.8 page 67 Example 2.5 page 68 In
Advanced Computer Architecture, Hwang
77
(No Transcript)
78
Successive Refinement
  • Partitioning is often independent of
    architecture, and may be
  • done first
  • View machine as a collection of communicating
    processors
  • Balancing the workload.
  • Reducing the amount of inherent communication
  • Reducing extra work.
  • Above three issues are conflicting.
  • Then deal with interactions with architecture
  • View machine as an extended memory hierarchy
  • Extra communication due to architectural
    interactions.
  • Cost of communication depends on how it is
    structured
  • This may inspire changes in partitioning.

79
Partitioning for Performance
  • Balancing the workload and reducing wait time at
    synch points
  • Reducing inherent communication.
  • Reducing extra work.
  • These algorithmic issues have extreme trade-offs
  • Minimize communication gt run on 1 processor.

  • gt extreme load imbalance.
  • Maximize load balance gt random assignment of
    tiny tasks.
  • gt no
    control over communication.
  • Good partition may imply extra work to compute or
    manage it
  • The goal is to compromise between the above
    extremes
  • Fortunately, often not difficult in practice.

80
Load Balancing and Synch Wait Time Reduction
  • Limit on speedup
  • Work includes data access and other costs.
  • Not just equal work, but must be busy at same
    time.
  • Four parts to load balancing and reducing synch
    wait time
  • 1. Identify enough concurrency.
  • 2. Decide how to manage it.
  • 3. Determine the granularity at which to exploit
    it
  • 4. Reduce serialization and cost of
    synchronization

81
Managing Concurrency
  • Static versus Dynamic techniques
  • Static
  • Algorithmic assignment based on input wont
    change
  • Low runtime overhead
  • Computation must be predictable
  • Preferable when applicable (except in
    multiprogrammed/heterogeneous environment)
  • Dynamic
  • Adapt at runtime to balance load
  • Can increase communication and reduce locality
  • Can increase task management overheads

82
Dynamic Load Balancing
  • To achieve best performance of a parallel
    computing system running a parallel problem,
    its essential to maximize processor utilization
    by distributing the computation load evenly or
    balancing the load among the available
    processors.
  • Optimal static load balancing, optimal mapping or
    scheduling, is an intractable NP-complete
    problem, except for specific problems on specific
    networks.
  • Hence heuristics are usually used to select
    processors for processes.
  • Even the best static mapping may offer the best
    execution time due to changing conditions at
    runtime and the process may need to done
    dynamically.
  • The methods used for balancing the computational
    load dynamically among processors can be broadly
    classified as
  • 1. Centralized dynamic load balancing.
  • 2. Decentralized dynamic load balancing.

83
Processor Load Balance Performance
84
Dynamic Tasking with Task Queues
  • Centralized versus distributed queues.
  • Task stealing with distributed queues.
  • Can compromise communication and locality, and
    increase synchronization.
  • Whom to steal from, how many tasks to steal, ...
  • Termination detection
  • Maximum imbalance related to size of task

85
Implications of Load Balancing
  • Extends speedup limit expression to
  • Speedupproblem(p)
  • Generally, responsibility of software
  • Architecture can support task stealing and synch
    efficiently
  • Fine-grained communication, low-overhead access
    to queues
  • Efficient support allows smaller tasks, better
    load balancing
  • Naming logically shared data in the presence of
    task stealing
  • Need to access data of stolen tasks, esp.
    multiply-stolen tasks
  • gt Hardware shared address space advantageous
  • Efficient support for point-to-point
    communication

86
Reducing Inherent Communication
  • Measure communication to computation ratio
  • Focus here is on inherent communication
  • Determined by assignment of tasks to processes
  • Actual communication can be greater
  • Assign tasks that access same data to same
    process
  • Optimal solution to reduce communication and
    achive an optimal load balance is NP-hard in the
    general case
  • Simple heuristic solutions work well in practice
  • Due to specific structure of applications.

87
Implications of Communication-to-Computation Ratio
  • Architects must examine application needs
  • If denominator is execution time, ratio gives
    average BW needs
  • If operation count, gives extremes in impact of
    latency and bandwidth
  • Latency assume no latency hiding
  • Bandwidth assume all latency hidden
  • Reality is somewhere in between
  • Actual impact of communication depends on
    structure and cost as well
  • Need to keep communication balanced across
    processors as well.

88
Reducing Extra Work (Overheads)
  • Common sources of extra work
  • Computing a good partition
  • e.g. partitioning in Barnes-Hut or sparse
    matrix
  • Using redundant computation to avoid
    communication
  • Task, data and process management overhead
  • Applications, languages, runtime systems, OS
  • Imposing structure on communication
  • Coalescing messages, allowing effective naming
  • Architectural Implications
  • Reduce need by making communication and
    orchestration efficient

89
Extended Memory-Hierarchy View of Multiprocessors
  • Levels in extended hierarchy
  • Registers, caches, local memory, remote memory
    (topology)
  • Glued together by communication architecture
  • Levels communicate at a certain granularity of
    data transfer
  • Need to exploit spatial and temporal locality in
    hierarchy
  • Otherwise extra communication may also be caused
  • Especially important since communication is
    expensive

90
Extended Hierarchy
  • Idealized view local cache hierarchy single
    main memory
  • But reality is more complex
  • Centralized Memory caches of other processors
  • Distributed Memory some local, some remote
    network topology
  • Management of levels
  • Caches managed by hardware
  • Main memory depends on programming model
  • SAS data movement between local and remote
    transparent
  • Message passing explicit
  • Improve performance through architecture or
    program locality
  • Tradeoff with parallelism need good node
    performance and parallelism

91
Artifactual Communication in Extended Hierarchy
  • Accesses not satisfied in local portion cause
    communication
  • Inherent communication, implicit or explicit,
    causes transfers
  • Determined by program
  • Artifactual communication
  • Determined by program implementation and arch.
    interactions
  • Poor allocation of data across distributed
    memories
  • Unnecessary data in a transfer
  • Unnecessary transfers due to system granularities
  • Redundant communication of data
  • finite replication capacity (in cache or main
    memory)
  • Inherent communication assumes unlimited
    capacity, small transfers, perfect knowledge of
    what is needed.
  • More on artifactual communication later first
    consider replication-induced further

92
Structuring Communication
  • Given amount of comm (inherent or artifactual),
    goal is to reduce cost
  • Cost of communication as seen by process
  • C f ( o l tc -
    overlap)
  • f frequency of messages
  • o overhead per message (at both ends)
  • l network delay per message
  • nc total data sent
  • m number of messages
  • B bandwidth along path (determined by network,
    NI, assist)
  • tc cost induced by contention per message
  • overlap amount of latency hidden by overlap
    with comp. or comm.
  • Portion in parentheses is cost of a message (as
    seen by processor)
  • That portion, ignoring overlap, is latency of a
    message
  • Goal reduce terms in latency and increase
    overlap

93
Reducing Overhead
  • Can reduce no. of messages m or overhead per
    message o
  • o is usually determined by hardware or system
    software
  • Program should try to reduce m by coalescing
    messages
  • More control when communication is explicit
  • Coalescing data into larger messages
  • Easy for regular, coarse-grained communication
  • Can be difficult for irregular, naturally
    fine-grained communication
  • May require changes to algorithm and extra work
  • coalescing data and determining what and to whom
    to send
  • Will discuss more in implications for programming
    models later

94
Reducing Network Delay
  • Network delay component fhth
  • h number of hops traversed in network
  • th linkswitch latency per hop
  • Reducing f Communicate less, or make messages
    larger
  • Reducing h
  • Map communication patterns to network topology
  • e.g. nearest-neighbor on mesh and ring
    all-to-all
  • How important is this?
  • Used to be a major focus of parallel algorithms
  • Depends on no. of processors, how th, compares
    with other components
  • Less important on modern machines
  • Overheads, processor count, multiprogramming

95
Overlapping Communication
  • Cannot afford to stall for high latencies
  • Overlap with computation or communication to hide
    latency
  • Requires extra concurrency (slackness), higher
    bandwidth
  • Techniques
  • Prefetching
  • Block data transfer
  • Proceeding past communication
  • Multithreading

96
Summary of Tradeoffs
  • Different goals often have conflicting demands
  • Load Balance
  • Fine-grain tasks
  • Random or dynamic assignment
  • Communication
  • Usually coarse grain tasks
  • Decompose to obtain locality not random/dynamic
  • Extra Work
  • Coarse grain tasks
  • Simple assignment
  • Communication Cost
  • Big transfers amortize overhead and latency
  • Small transfers reduce contention

97
Relationship Between Perspectives
98
Summary
  • Speedupprob(p)
  • Goal is to reduce denominator components
  • Both programmer and system have role to play
  • Architecture cannot do much about load imbalance
    or too much communication
  • But it can
  • reduce incentive for creating ill-behaved
    programs (efficient naming, communication and
    synchronization)
  • reduce artifactual communication
  • provide efficient naming for flexible assignment
  • allow effective overlapping of communication

99
Generic Distributed Memory Organization
OS Supported? Network protocols?
Multi-stage interconnection network
(MIN)? Custom-designed?
Global virtual Shared address space?
Message transaction DMA?
  • Network bandwidth?
  • Bandwidth demand?
  • Independent processes?
  • Communicating processes?
  • Latency? O(log2P) increase?
  • Cost scalability of system?

Node O(10) Bus-based SMP
Custom-designed CPU? Node/System integration
level? How far? Cray-on-a-Chip? SMP-on-a-Chip?
About PowerShow.com