Architecture%20Classifications

About This Presentation

Title:

Architecture%20Classifications

Description:

A single instruction stream (broadcast to all PE*s), acting on multiple data. ... Independent instruction streams, acting on different (but related) data ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 19

Provided by: SAJ80

Category:

more less

Transcript and Presenter's Notes

Title: Architecture%20Classifications

1
Architecture Classifications

A taxonomy of parallel architectures in 1972,
Flynn categorised HPC architectures into four
classes, based on how many instruction and data
streams can be observed in the architecture.
They are
SISD - Single Instruction, Single Data
Operate sequentially on a single stream of
instructions in single memory.
Classic Von Neumann architecture.
Machines may still consist of multiple
processors, operating on independent data - these
can be considered as multiple SISD systems.
SIMD - Single Instruction, Multiple Data
A single instruction stream (broadcast to all
PEs), acting on multiple data.
The most common form of this architecture class
are Vector processors.
These can deliver results several times faster
than scalar processors.

PE Processing Element
2
Architecture Classifications

MISD - Multiple instruction, Single data
There is a debate about whether the architecture
with uniformly shared memory and separated cache
is MISD or MIMD (MIMD is favoured)
No other practical implementations of this
architecture.
MIMD - Multiple instruction, Multiple data
Independent instruction streams, acting on
different (but related) data
Note the difference between multiple SISD and
MIMD

3
Architecture Classifications

MIMD SMP, NUMA, MPP, Cluster
SISD Machine with a single scalar processor
SIMD Machine with vector processors

4
Architecture Classifications

Shared memory (uniform memory access)
Processors share access to a common memory space.
Implemented over a shared memory bus or
communication network.
Memory locks required
Local cache is critical
If not, bus contention (or network traffic)
reduces the systems efficiency.
For this reason, pure shared memory systems do
not scale (Scalability is the measure of how well
the system performance improves linearly to the
number of processing elements)
Naturally, cache introduces problems of coherency
(ensuring that stale cache lines are invalidated
when other processors alter shared memory).
Support for critical sections are required

Shared Memory
Interconnect
PE 0
PE n
5
Architecture Classifications

Shared memory (Non-uniform memory access)
PE may be fetching from local or remote memory -
hence non-uniform access times.
NUMA
cc-NUMA (cache-coherent Non-Uniform Memory
Access)
Groups of processors are connected together by a
fast interconnect (SMP)
These are then connected together by a high-speed
interconnect.
Global address space.

Interconnect
Shared Memory m
Shared Memory 1
PE (m-1)n1
PE m.n
PE 1
PE n
6
Architecture Classifications

Distributed Memory
Each processor has its own local memory.
When processors need to exchange (or share data),
they must do this through an explicit
communication
Message passing (MPI language)
Typically larger latencies between PEs
(especially if they communicate via over-network
interconnections).
Scalability, however, is good if the problems can
be sufficiently contained within PEs.
Typically, coarse-grained work units are
distributed.

Interconnect
PE 0
PE n
M 0
M n
7
In-processor Parallelism

Pipelines
Instruction pipelines
Reduces the idle time of hardware components.
Good performance with independent instructions.
Performing more operations per clock cycle.
Discrepancy between peak and actual performance
often caused by pipeline effects
Difficult to keep pipelines full.
Branch prediction helps.

8
In-processor Parallelism

Vector architectures
Fast I/O - powerful busses and interconnections.
Large memory bandwidth and low latency access.
No cache because of above.
Perform operations involving large matrices,
commonly encountered in engineering areas

9
In-processor Parallelism

Commodity processors increasingly provide
performance as good as dedicated Vector
processors
Price/performance is also far better.
Commodity processors now offer good performance
for vectorizable code.
Explicit support for vectorization with SIMD
instructions on COTS processors
Altivec on PowerPC
SSE (Streaming SIMD Extension) on x86

10
Multiprocessor Parallelism

Use multiple processors on the same program
Divide workload up between processors.
Often achieved by dividing up a data structure.
Each processor works on its own data.
Typically processors need to communicate.
Shared or distributed memory is one approach
Explicit messaging is increasingly common.
Load balancing is critical for maintaining good
performance.

11
Multiprocessor Parallelism
Single Processor
Symmetric Multiprocessor with Shared Memory
CPU
CPU
CPU
CPU
Mem
Mem
MPP System
Net
CPU
CPU
CPU
Mem
Mem
Mem
12
Clusters

Built using COTS components.
Brought about by improved processor speed as well
as networking and switching technology.
Mass-produced commodity off-the-shelf (COTS)
hardware, rather than expensive proprietary
hardware built solely for supercomputers.

13
Clusters

Clusters are simpler to manage
Single image, single identity
Often run familiar operating systems.
Linux is probably the most popular
Commodity compilers and support
Node for node swap-out on failure.
Can run multi-processor parallel tasks.
Or run sequential tasks for multiple users
(job-level parallelism).

14
Clusters

Clustering of SMPs
Attractive method of achieving high performance.
SMPs reduce the network overhead

15
Parallel Efficiency

Main issues that effect parallel efficiency are
Ratio of computation to communication
Higher computation usually yields better
performance.
Communication bandwidth latency
Latency has the biggest impact.
Scalability
How does the bandwidth latency scale with the
number of processors.

16
Dependency and Parallelism

Granularity of parallelism the size of the
computations that are being performed in parallel
Four types of parallelism (in order of
granularity size)
Instruction-level parallelism (e.g. pipeline)
Thread-level parallelism (e.g. run a multi-thread
java program)
Process-level parallelism (e.g. run an MPI job in
a cluster)
Job-level parallelism (e.g. run a batch of
independent single-processor jobs in a cluster)

17
Dependency and Parallelism

Dependency If event A must occur before event B,
then B is dependent on A
Two types of Dependency
Control dependency waiting for the instruction
which controls the execution flow to be completed
IF (X!0) Then Y1.0/X Y has the control
dependency on X!0
Data dependency dependency because of
calculations or memory access
Flow dependency AXY BAC
Anti-dependency BAC AXY
Output dependency A2 XA1 A5

18
Identifying Dependency

Draw a Directed Acyclic Graph (DAG) to identify
the dependency among a sequence of instructions
Anti-dependency a variable appears as a parent
in a calculation and then as a child in a later
calculation
Output dependency a variable appears as a child
in a calculation and then as a child again in a
later calculation

Write a Comment

User Comments (0)

About PowerShow.com

Architecture%20Classifications - PowerPoint PPT Presentation

Architecture%20Classifications

A single instruction stream (broadcast to all PE*s), acting on multiple data. ... Independent instruction streams, acting on different (but related) data ... – PowerPoint PPT presentation