A Survey of Parallel Computer Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

A Survey of Parallel Computer Architectures

Description:

For example: Adding two real Arrays A, B. shows in the below figure. 12/2/09. 17 ... set to zero and, during each cycle, adds the product of its two inputs to ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 38
Provided by: public7
Category:

less

Transcript and Presenter's Notes

Title: A Survey of Parallel Computer Architectures


1
A Survey of Parallel Computer Architectures
  • CSC521 Advanced Computer Architecture
  • Dr. Craig Reinhart
  • The General and Logical Theory of Automata
  • Lin, Shu-Hsien (ANDY)
  • 4/24/2008

2
What is the parallel computer architectures?
  • Pipelining Instruction
  • Multiple CPU Functional Units
  • Separate CPU and I/O processors

3
Pipelining Instruction
  • Decompose the pipelining instruction, it can be
    executed into a linear series of autonomous
    stages, allowing each stage to simultaneously
    perform every portion of the execution process
    (such as decoding, calculating effective address,
    fetching operand, executing, and storing).

4
Pipelining vs. Single-cycle processors
  • For single-cycle processor it takes 16 nanosecond
    to execute four instructions, while for
    pipelining processor it takes only 7 nanoseconds.

5
Multiple CPU Functional Units(1)
  • Multiple CPU Functional Units provides
    independent functional units for arithmetic and
    Boolean operations that execute concurrently.

6
Multiple CPU Functional Units(2)
  • A parallel computer has three types of parts
  • Processors
  • Memory modules
  • Communication / synchronization network

7
Single Processor V.S. Multi-Processor
Energy-Efficient Performance
  • The two figures are showing the different
  • performance with a single processor and
  • multi-processors
  • The right above figure shows that based on the
    single processor. Increasing clock frequency by
    20 to single processor delivers a 13 percent
    performance gain, but require73 greater Power.
  • The below figure shows that adding a second
    processor on the under-clocking experience, the
    clock frequency effectively delivers 73 more
    performance.

8
Separate CPU and I/O processors
  • Freeing the CPU from I/O control responsibilities
    by using dedicated I/O processors solutions
    range from relatively simple I/O controllers to
    complex peripheral processing units.

9
Example Intel IPX2800
10
High-level Taxonomy of Parallel Computer
Architectures
  • A parallel architecture provides an explicit,
    high-level framework for the parallel programming
    solutions by providing multiple processors,
    whether simple or complex, that cooperate to
    solve problems through concurrent execution.

11
Flynns TaxonomyClassifies Architectures (1)
  • SISD--(single instruction, single data stream)
  • -- Defines serial computers.
  • -- An ordinary computer
  • MISD--(multiple instruction, single data stream)
  • -- It would involve multiple processors
    applying different instructions to a single
    datum this hypothetical possibility is generally
    deemed to be impractical.

12
Flynns TaxonomyClassifies Architectures (2)
  • SIMD--(single instruction, multiple data streams)
  • -- It involves multiple processors
    simultaneously executing the same instruction on
    different data
  • -- Massively parallel army-of-ants approach
    processors execute the same sequence of
    instructions (or else NO-OP) in lockstep (TMC
    CM-2)
  • MIMD--(multiple instruction, multiple data
    streams)
  • -- It involves multiple processors
    autonomously executing diverse instructions on
    diverse data
  • -- It is true, symmetric, parallel computing
    (Sun Enterprise)

13
Pipelined Vector Processors
  • Vector processors are characterized by multiple,
    pipelined functional units.
  • The architecture provides parallel vector
    processing by sequentially streaming vector
    elements through a functional unit pipeline and
    by streaming the output results of one units into
    the pipeline of another as input.

14
Register-to-Register Vector Architecture
Operation.
  • Each pipeline stage in the hypothetical
    architecture has a cycle time of 20 nanoseconds.
  • And then 120 ns elapse from the time operands a1
    and b1 enter stage 1 until result c1 is available.

15
SIMD Architectures
  • SIMD architectures employ
  • a central control unit
  • multiple processors
  • Interconnection network (IN)
  • Function
  • --For either processor-to-processor or
    processor-to-memory communication.

16
SIMD Architectures computed Example
  • A SIMD architectures Vector Computer
  • For example Adding two real Arrays A, B. shows
    in the below figure

17
SIMD Architectures Problems
  • Some SIMD problems, e.g.
  • SIMD cannot use commodity processors
  • SIMD cannot supports multiple users
  • SIMD is less efficiency in conditionally executed
    parallel code

18
Bit-Plane Array Processing
  • Processor arrays structured for numerical SIMD
    execution have been employed for large-scale
    scientific calculations. For example Image
    processing and nuclear energy model.
  • In bit-plane architectures, the array of
    processors is arranged in a symmetrical grid
    (such as 64x64) and associated with multiple
    planes of memory bits that correspond to the
    dimensions of the processor grid.

19
Associative Memory Processing Organization
  • The right figure is showing the
  • characteristic functional units of an
  • associative memory processor.
  • A program controller (serial computer) reads and
    executes instructions, invoking a specialized
    array controller when associative memory
    instructions are encountered.
  • Special registers enable the program controller
    and associative memory to share data

20
Associative Memory Comparison Operation
  • The right figure shows a row-oriented comparison
    operation for a generic bit-serial architecture.
  • All of the associative processing elements start
    at a specified memory column and compare the
    contents of four consecutive bits in their row
    against the comparison register contents, setting
    a bit in the A register to indicate whether or
    not their row contains a match.

21
Associative Memory Logical OR Operation
  • The right figure is showing that a logical OR
    operation is performed on a bit-column and the
    bit-vector in register A, with register B
    receiving the results.
  • A zero in the mask register indicates that the
    associated word is not to be included in the
    current operation.

22
Systolic Flow of Data From and to Memory
  • Systolic architectures (systolic arrays) are
    pipelined multiprocessors in which data is pulsed
    in rhythmic fashion from memory and through a
    network of processors before returning to memory

23
Systolic Matrix Multiplication
  • Right figure is a simple systolic array
  • A ab and B ef
  • cd gh
  • The zero inputs shown moving through the array
    are used for synchronization.
  • Each processor begins with an accumulator set to
    zero and, during each cycle, adds the product of
    its two inputs to the accumulator.
  • After five cycles the matrix product is complete.

Go back page 34
24
MIMD Architectures
  • MIMD architectures employ multiple instruction
    streams, using local data.
  • MIMD computers support parallel solution that
    require processors to operate in a largely
    autonomous manner

25
MIMD Distributed Memory Architecture Structure
  • Distributed memory architectures connect
    processing nodes (consisting of an autonomous
    processor and its local architecture structure.
    memory) with a processor-to-processor
    interconnection network.

26
Example Distributed Memory Architectures
  • Right figure is IBM RS/6000 SP machine. It is a
    distributed memory architectures machine.

27
Interconnection Network Topologies
  • Various interconnection network topologies have
    been proposed to support architectural
    expandability and provide efficient performance
    for parallel programs with differing
    interprocessor communication patterns.
  • Right figure a-e depicts the topologies which are
    a) ring, b) mesh, c) tree, d) hypercube, and e)
    tree mapped to a reconfigurable mesh

28
Shared-Memory Architecture
  • Shared memory architectures accomplish
    interprocessor coordination by providing a
    global, shared memory that each processor can
    address.
  • In the right figure is showing some major
    alternatives for connecting multiple processors
    to shared memory.
  • Figure a) shows bus interconnection, b) shows 2
    ?2 crossbar, c) shows 8 ?8 omega MIN routing a P3
    require to M3

29
MIMD/SIMD Operations
  • A MIMD architecture to be controlled in SIMD
    fashion.
  • The master/slaves relation of a SIMD
    architectures controller and processors can be
    mapped onto the node/descendents relation of a
    subtree
  • When the root processor node of a subtree
    operates as a SIMD controller, it transmits
    instructions to descendent nodes that execute the
    instructions on local memory data.

30
Dataflow Architectures
  • Dataflow architectures. The fundamental feature
    of dataflow architectures is an execution
    paradigm in which instructions are enabled for
    execution as soon as all of their operands become
    available.
  • The program fragment dataflow graphs is shown in
    the right figure.

31
Dataflow Token-Matching Example
  • At step 1, the execution of (3a)gt the result is
    15 and the instruction at node 3 requires an
    operand.
  • Step 2, the match this tokengt the result of
    token of (5b) with the node 3 instruction.
  • Step 3, the matching unit creates the instruction
    token (template).
  • Step 4, the node store unit obtains the relevant
    instruction opcode from memory.
  • Step 5, The node store unit then fills in the
    relevant token fields and assigns
  • the instruction to a processor.
  • The execution of the instruction
  • will create a new result token
  • to be used as input to the node
  • 4 instruction.

32
Reduction Architecture Demand Token Production(1)
  • The figure is a simplified version of a
    graph-reduction architecture that maps the
    program below onto trees-structured processors
    and passes tokens that demand or return results.
  • In the right figure, it depicts all the demand
    tokens produced by the program, as demands for
    the values of references propagate down the tree.
  • The algorithms example is shown as below
  • abc
  • bde
  • cfg
  • dle3f5g7

33
Reduction Architecture Demand Token Production(2)
  • In the right figure, it depicts the last two
    results that tokens produced are shown as they
    are passed to the root node.
  • The Algorithm example is shown as below
  • abc
  • bde
  • cfg
  • dle3f5g7

34
Wavefront Array Matrix Multiplication (1)
  • The right figure a-c with the following
    PowerPoint slices depicts the wavefront array
    concepts, using the matrix multiplication example
    used earlier to illustrate systolic operation on
    page 23 of PowerPoint
  • Figure (a) shown on the right side is the
    situation that buffers are initially filled after
    memory input.

35
Wavefront Array Matrix Multiplication (2)
  • Figure (b) ,PE(1,1) adds the product ae to its
    accumulator and transmits operands a and e to
    neighbors thus, the first computational
    wavefront is shown propagating from PE(l,l) to
    PE(1,2) and PE(2,l).

36
Wavefront Array Matrix Multiplication (3)
  • Figure (c ) shows the first computational
    wavefront continuing to propagate, while a second
    wavefront is propagated by PE(l,l).

37
Conclusion
  • What is the parallel computer architectures? R.W.
    Hackney term
  • ?A confusing menagerie of computer designs.
Write a Comment
User Comments (0)
About PowerShow.com