Lecture 2 Parallel Programming Platforms - PowerPoint PPT Presentation


PPT – Lecture 2 Parallel Programming Platforms PowerPoint presentation | free to download - id: 66f624-NzVlN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Lecture 2 Parallel Programming Platforms


Lecture 2 Parallel Programming Platforms Parallel Computing Spring 2010 * * Implicit Parallelism: Trends in Microprocessor Architectures Microprocessor clock speeds ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Date added: 20 March 2020
Slides: 27
Provided by: xw37


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms
  • Parallel Computing
  • Spring 2015

Implicit Parallelism Trends in Microprocessor
  • Microprocessor clock speeds have posted
    impressive gains over the past two decades (two
    to three orders of magnitude).
  • Higher levels of device integration have made
    available a large number of transistors.
  • The question of how best to utilize these
    resources is an important one.
  • Current processors use these resources in
    multiple functional units and execute multiple
    instructions in the same cycle.
  • The precise manner in which these instructions
    are selected and executed provides impressive
    diversity in architectures.

Implicit Parallelism
  • Pipelining and Superscalar execution
  • Very Long Instruction Word Processors

Pipelining and Superscalar Execution
  • Pipelining overlaps various stages of instruction
    execution to achieve performance.
  • At a high level of abstraction, an instruction
    can be executed while the next one is being
    decoded and the next one is being fetched.
  • This is akin to an assembly line for manufacture
    of cars.

Pipelining and Superscalar Execution
  • Limitations
  • The speed of a pipeline is eventually limited by
    the slowest stage.
  • For this reason, conventional processors rely on
    very deep pipelines (20 stage pipelines in
    state-of-the-art Pentium processors).
  • However, in typical program traces, every 5-6th
    instruction is a conditional jump! This requires
    very accurate branch prediction.
  • The penalty of a misprediction grows with the
    depth of the pipeline, since a larger number of
    instructions will have to be flushed.

Pipelining and Superscalar Execution
  • One simple way of alleviating these bottlenecks
    is to use multiple pipelines.
  • The question then becomes one of selecting these

Superscalar Execution An Example
Example of a two-way superscalar execution of
Superscalar Execution
  • Scheduling of instructions is determined by a
    number of factors
  • True Data Dependency The result of one operation
    is an input to the next.
  • Resource Dependency Two operations require the
    same resource.
  • Branch Dependency Scheduling instructions across
    conditional branch statements cannot be done
    deterministically a-priori.
  • The scheduler, a piece of hardware looks at a
    large number of instructions in an instruction
    queue and selects appropriate number of
    instructions to execute concurrently based on
    these factors.
  • The complexity of this hardware is an important
    constraint on superscalar processors.

Very Long Instruction Word (VLIW) Processors
  • The hardware cost and complexity of the
    superscalar scheduler is a major consideration in
    processor design.
  • To address this issues, VLIW processors rely on
    compile time analysis to identify and bundle
    together instructions that can be executed
  • These instructions are packed and dispatched
    together, and thus the name very long instruction
  • This concept was used with some commercial
    success in the Multiflow Trace machine (circa
  • Variants of this concept are employed in the
    Intel IA64 processors.

Very Long Instruction Word (VLIW) Processors
  • Issue hardware is simpler.
  • Compiler has a bigger context from which to
    select co-scheduled instructions.
  • Compilers, however, do not have runtime
    information such as cache misses. Scheduling is,
    therefore, inherently conservative.
  • Branch and memory prediction is more difficult.
  • VLIW performance is highly dependent on the
    compiler. A number of techniques such as loop
    unrolling, speculative execution, branch
    prediction are critical.
  • Typical VLIW processors are limited to 4-way to
    8-way parallelism.

Limitations of Memory System Performance
  • Memory system, and not processor speed, is often
    the bottleneck for many applications.
  • Memory system performance is largely captured by
    two parameters, latency and bandwidth.
  • Latency is the time from the issue of a memory
    request to the time the data is available at the
  • Bandwidth is the rate at which data can be pumped
    to the processor by the memory system.

Improving Effective Memory Latency Using Caches
  • Caches are small and fast memory elements between
    the processor and DRAM.
  • This memory acts as a low-latency high-bandwidth
  • If a piece of data is repeatedly used, the
    effective latency of this memory system can be
    reduced by the cache.
  • The fraction of data references satisfied by the
    cache is called the cache hit ratio of the
    computation on the system.
  • Cache hit ratio achieved by a code on a memory
    system often determines its performance.

Memory System Performance Summary
  • Exploiting spatial and temporal locality in
    applications is critical for amortizing memory
    latency and increasing effective memory
  • The ratio of the number of operations to number
    of memory accesses is a good indicator of
    anticipated tolerance to memory bandwidth.
  • Memory layouts and organizing computation
    appropriately can make a significant impact on
    the spatial and temporal locality.

Alternate Approaches for Hiding Memory Latency
  • Prefetching
  • Multithreading
  • Spatial locality in accessing memory words.

Tradeoffs of Multithreading and Prefetching
  • Bandwidth requirements of a multithreaded system
    may increase very significantly because of the
    smaller cache residency of each thread.
  • Multithreaded systems become bandwidth bound
    instead of latency bound.
  • Multithreading and prefetching only address the
    latency problem and may often exacerbate the
    bandwidth problem.
  • Multithreading and prefetching also require
    significantly more hardware resources in the form
    of storage.

Explicitly Parallel Platforms
Flynns taxonomy of computer architectures
(control mechanism)
  • Depending on the execution and data streams
    computer architectures can be distinguished into
    the following groups.
  • (1) SISD (Single Instruction Single Data) This
    is a sequential computer.
  • (2) SIMD (Single Instruction Multiple Data)
    This is a parallel machine like the TM CM-200.
    SIMD machines are suited for data-parallel
    programs where the same set of instructions are
    executed on a large data set.
  • Some of the earliest parallel computers such as
    the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1
    belonged to this class of machines
  • (3) MISD (Multiple Instructions Single Data)
    Some consider a systolic array a member of this
  • (4) MIMD (Multiple Instructions Multiple Data)
    All other parallel machines. A MIMD architecture
    can be an MPMD or an SPMD. In a Multiple Program
    Multiple Data organization, each processor
    executes its own program as opposed to a single
    program that is executed by all processors on a
    Single Program Multiple Data architecture.
  • Examples of such platforms include current
    generation Sun Ultra Servers, SGI Origin Servers,
    multiprocessor PCs, workstation clusters, and the
    IBM SP
  • Note Some consider CM-5 as a combination of a
    MIMD and SIMD as it contains control hardware
    that allows it to operatein a SIMD mode.

SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical
MIMD architecture (b).
Taxonomy based on Address-Space Organization
(memory distribution)
  • Message-Passing Architecture
  • In a distributed memory machine each processor
    has its own exclusive address space. On such
    platforms, interactions between processes running
    on different nodes must be accomplished using
    messages, hen the name message-passing. Such
    machines are commonly referred to as
  • Examples Cray T3D/T3E, IBM SP1/SP2, workstation
  • Shared-Address-Space Architecture
  • Provides hardware support for read/write to a
    shared address space. Machines built this way are
    often called multiprocessors.
  • (1) A shared memory machine has a single address
    space shared by all processors (UMA, for Uniform
    Memory Access).
  • The time taken by a processor to access any
    memory word in the system is identical.
  • Examples SGI Power Challenge, SMP machines.
  • (2) A distributed shared memory system is a
    hybrid between the two previous ones. A global
    address space is shared among the processors but
    is distributed among them. Each processor can
    access its own memory faster than it can access
    the memory of a remote processor (NUMA for
    Non-Uniform Memory Access). Example SGI Origin
  • Note The existence of a cache in shared-memory
    parallel machines cause cache coherence problems
    when a cached variable is modified by a processor
    and the shared-variable is requested by another
    processor. cc-NUMA for cachecoherent NUMA
    architectures (Origin 2000).

NUMA and UMA Shared-Address-Space Platforms
Typical shared-address-space architectures (a)
Uniform-memory access shared-address-space
computer (b) Uniform-memory-access
shared-address-space computer with caches and
memories (c) Non-uniform-memory-access
shared-address-space computer with local memory
Message Passing vs. Shared Address Space
  • Message passing requires little hardware support,
    other than a network.
  • Shared address space platforms can easily emulate
    message passing. The reverse is more difficult to
    do (in an efficient manner).

Taxonomy based on processor granularity
  • The granularity sometimes refers to the power of
    individual processors. Sometimes is also used to
    denote the degree of parallelism.
  • (1) A coarse-grained architecture consists of
    (usually few) powerful processors (eg old Cray
  • (2) a fine-grained architecture consists of
    (usually many inexpensive) processors (eg TM
    CM-200, CM-2).
  • (3) a medium-grained architecture is between the
    two (eg CM-5).
  • Process Granularity refers to the amount of
    computation assigned to a particular processor of
    a parallel machine for a given parallel program.
    It also refers, within a single program, to the
    amount of computation performed before
    communication is issued. If the amount of
    computation is small (low degree of concurrency)
    a process is fine-grained. Otherwise granularity
    is coarse.

Taxonomy based on processor synchronization
  • (1) In a fully synchronous system a global clock
    is used to synchronize all operations performed
    by the processors.
  • (2) An asynchronous system lacks any
    synchronization facilities. Processor
    synchronization needs to be explicit in a users
  • (3) A bulk-synchronous system comes in between a
    fully synchronous and an asynchronous system.
    Synchronization of processors is required only at
    certain parts of the execution of a parallel

Physical Organization of Parallel Platforms
ideal architecture(PRAM)
  • The Parallel Random Access Machine (PRAM) is one
    of the simplest ways to model a parallel
  • A PRAM consists of a collection of (sequential)
    processors that can synchronously access a global
    shared memory in unit time. Each processor can
    thus access its shared memory as fast (and
    efficiently) as it can access its own local
  • The main advantages of the PRAM is its simplicity
    in capturing parallelism and abstracting away
    communication and synchronization issues related
    to parallel computing.
  • Processors are considered to be in abundance and
    unlimited in number. The resulting PRAM
    algorithms thus exhibit unlimited parallelism
    (number of processors used is a function of
    problem size).
  • The abstraction thus offered by the PRAM is a
    fully synchronous collection of processors and a
    shared memory which makes it popular for parallel
    algorithm design.
  • It is, however, this abstraction that also makes
    the PRAM unrealistic from a practical point of
  • Full synchronization offered by the PRAM is too
    expensive and time demanding in parallel machines
    currently in use.
  • Remote memory (i.e. shared memory) access is
    considerably more expensive in real machines than
    local memory access
  • UMA machines with unlimited parallelism are
    difficult to build.

Four Subclasses of PRAM
  • Depending on how concurrent access to a single
    memory cell (of the shared memory) is resolved,
    there are various PRAM variants.
  • ER (Exclusive Read) or EW (Exclusive Write) PRAMs
    do not allow concurrent access of the shared
  • It is allowed, however, for CR (Concurrent Read)
    or CW (Concurrent Write) PRAMs.
  • Combining the rules for read and write access
    there are four PRAM variants
  • EREW
  • access to a memory location is exclusive. No
    concurrent read or write operations are allowed.
  • Weakest PRAM model
  • CREW
  • Multiple read accesses to a memory location are
    allowed. Multiple write accesses to a memory
    location are serialized.
  • ERCW
  • Multiple write accesses to a memory location are
    allowed. Multiple read accesses to a memory
    location are serialized.
  • Can simulate an EREW PRAM
  • CRCW
  • Allows multiple read and write accesses to a
    common memory location.
  • Most powerful PRAM model
  • Can simulate both EREW PRAM and CREW PRAM

Resolve concurrent write access
  • (1) in the arbitrary PRAM, if multiple processors
    write into a single shared memory cell, then an
    arbitrary processor succeeds in writing into this
  • (2) in the common PRAM, processors must write the
    same value into the shared memory cell.
  • (3) in the priority PRAM the processor with the
    highest priority (smallest or largest indexed
    processor) succeeds in writing.
  • (4) in the combining PRAM if more than one
    processors write into the same memory cell, the
    result written into it depends on the combining
    operator. If it is the sum operator, the sum of
    the values is written, if it is the maximum
    operator the maximum is written.

Note An algorithm designed for the common PRAM
can be executed on a priority or arbitrary PRAM
and exhibit similar complexity. The same holds
for an arbitrary PRAM algorithm when run on a
priority PRAM.
About PowerShow.com