Architecture%20Classifications - PowerPoint PPT Presentation

About This Presentation
Title:

Architecture%20Classifications

Description:

A single instruction stream (broadcast to all PE*s), acting on multiple data. ... Independent instruction streams, acting on different (but related) data ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: SAJ80
Category:

less

Transcript and Presenter's Notes

Title: Architecture%20Classifications


1
Architecture Classifications
  • A taxonomy of parallel architectures in 1972,
    Flynn categorised HPC architectures into four
    classes, based on how many instruction and data
    streams can be observed in the architecture.
  • They are
  • SISD - Single Instruction, Single Data
  • Operate sequentially on a single stream of
    instructions in single memory.
  • Classic Von Neumann architecture.
  • Machines may still consist of multiple
    processors, operating on independent data - these
    can be considered as multiple SISD systems.
  • SIMD - Single Instruction, Multiple Data
  • A single instruction stream (broadcast to all
    PEs), acting on multiple data.
  • The most common form of this architecture class
    are Vector processors.
  • These can deliver results several times faster
    than scalar processors.

PE Processing Element
2
Architecture Classifications
  • MISD - Multiple instruction, Single data
  • There is a debate about whether the architecture
    with uniformly shared memory and separated cache
    is MISD or MIMD (MIMD is favoured)
  • No other practical implementations of this
    architecture.
  • MIMD - Multiple instruction, Multiple data
  • Independent instruction streams, acting on
    different (but related) data
  • Note the difference between multiple SISD and
    MIMD

3
Architecture Classifications
  • MIMD SMP, NUMA, MPP, Cluster
  • SISD Machine with a single scalar processor
  • SIMD Machine with vector processors

4
Architecture Classifications
  • Shared memory (uniform memory access)
  • Processors share access to a common memory space.
  • Implemented over a shared memory bus or
    communication network.
  • Memory locks required
  • Local cache is critical
  • If not, bus contention (or network traffic)
    reduces the systems efficiency.
  • For this reason, pure shared memory systems do
    not scale (Scalability is the measure of how well
    the system performance improves linearly to the
    number of processing elements)
  • Naturally, cache introduces problems of coherency
    (ensuring that stale cache lines are invalidated
    when other processors alter shared memory).
  • Support for critical sections are required

Shared Memory
Interconnect
PE 0
PE n
5
Architecture Classifications
  • Shared memory (Non-uniform memory access)
  • PE may be fetching from local or remote memory -
    hence non-uniform access times.
  • NUMA
  • cc-NUMA (cache-coherent Non-Uniform Memory
    Access)
  • Groups of processors are connected together by a
    fast interconnect (SMP)
  • These are then connected together by a high-speed
    interconnect.
  • Global address space.

Interconnect
Shared Memory m
Shared Memory 1
PE (m-1)n1
PE m.n
PE 1
PE n
6
Architecture Classifications
  • Distributed Memory
  • Each processor has its own local memory.
  • When processors need to exchange (or share data),
    they must do this through an explicit
    communication
  • Message passing (MPI language)
  • Typically larger latencies between PEs
    (especially if they communicate via over-network
    interconnections).
  • Scalability, however, is good if the problems can
    be sufficiently contained within PEs.
  • Typically, coarse-grained work units are
    distributed.

Interconnect
PE 0
PE n
M 0
M n
7
In-processor Parallelism
  • Pipelines
  • Instruction pipelines
  • Reduces the idle time of hardware components.
  • Good performance with independent instructions.
  • Performing more operations per clock cycle.
  • Discrepancy between peak and actual performance
    often caused by pipeline effects
  • Difficult to keep pipelines full.
  • Branch prediction helps.

8
In-processor Parallelism
  • Vector architectures
  • Fast I/O - powerful busses and interconnections.
  • Large memory bandwidth and low latency access.
  • No cache because of above.
  • Perform operations involving large matrices,
    commonly encountered in engineering areas

9
In-processor Parallelism
  • Commodity processors increasingly provide
    performance as good as dedicated Vector
    processors
  • Price/performance is also far better.
  • Commodity processors now offer good performance
    for vectorizable code.
  • Explicit support for vectorization with SIMD
    instructions on COTS processors
  • Altivec on PowerPC
  • SSE (Streaming SIMD Extension) on x86

10
Multiprocessor Parallelism
  • Use multiple processors on the same program
  • Divide workload up between processors.
  • Often achieved by dividing up a data structure.
  • Each processor works on its own data.
  • Typically processors need to communicate.
  • Shared or distributed memory is one approach
  • Explicit messaging is increasingly common.
  • Load balancing is critical for maintaining good
    performance.

11
Multiprocessor Parallelism
Single Processor
Symmetric Multiprocessor with Shared Memory
CPU
CPU
CPU
CPU
Mem
Mem
MPP System
Net
CPU
CPU
CPU
Mem
Mem
Mem
12
Clusters
  • Built using COTS components.
  • Brought about by improved processor speed as well
    as networking and switching technology.
  • Mass-produced commodity off-the-shelf (COTS)
    hardware, rather than expensive proprietary
    hardware built solely for supercomputers.

13
Clusters
  • Clusters are simpler to manage
  • Single image, single identity
  • Often run familiar operating systems.
  • Linux is probably the most popular
  • Commodity compilers and support
  • Node for node swap-out on failure.
  • Can run multi-processor parallel tasks.
  • Or run sequential tasks for multiple users
    (job-level parallelism).

14
Clusters
  • Clustering of SMPs
  • Attractive method of achieving high performance.
  • SMPs reduce the network overhead

15
Parallel Efficiency
  • Main issues that effect parallel efficiency are
  • Ratio of computation to communication
  • Higher computation usually yields better
    performance.
  • Communication bandwidth latency
  • Latency has the biggest impact.
  • Scalability
  • How does the bandwidth latency scale with the
    number of processors.

16
Dependency and Parallelism
  • Granularity of parallelism the size of the
    computations that are being performed in parallel
  • Four types of parallelism (in order of
    granularity size)
  • Instruction-level parallelism (e.g. pipeline)
  • Thread-level parallelism (e.g. run a multi-thread
    java program)
  • Process-level parallelism (e.g. run an MPI job in
    a cluster)
  • Job-level parallelism (e.g. run a batch of
    independent single-processor jobs in a cluster)

17
Dependency and Parallelism
  • Dependency If event A must occur before event B,
    then B is dependent on A
  • Two types of Dependency
  • Control dependency waiting for the instruction
    which controls the execution flow to be completed
  • IF (X!0) Then Y1.0/X Y has the control
    dependency on X!0
  • Data dependency dependency because of
    calculations or memory access
  • Flow dependency AXY BAC
  • Anti-dependency BAC AXY
  • Output dependency A2 XA1 A5

18
Identifying Dependency
  • Draw a Directed Acyclic Graph (DAG) to identify
    the dependency among a sequence of instructions
  • Anti-dependency a variable appears as a parent
    in a calculation and then as a child in a later
    calculation
  • Output dependency a variable appears as a child
    in a calculation and then as a child again in a
    later calculation
Write a Comment
User Comments (0)
About PowerShow.com