Parallel Programming Platforms

About This Presentation

Title:

Parallel Programming Platforms

Description:

Chapter 2 Parallel Programming Platforms Reference: http://www-users.cs.umn.edu/~karypis/parbook/ http://www.eel.tsint.edu.tw/teacher/ttsu/teach01.htm – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 78

Provided by: 12315

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming Platforms

1
Parallel Programming Platforms
Chapter 2
Reference http//www-users.cs.umn.edu/karypis/pa
rbook/ http//www.eel.tsint.edu.tw/teacher/ttsu/te
ach01.htm
2
Introduction

The traditional logical view of a sequential
computer (?????) consists of
a memory connected to a processor via a datapath.
All these three components processor, memory,
and datapath
Present bottlenecks to the overall processing
rate of a computer system.

3
Introduction

A number of architectural innovations?? over the
years have addressed these bottlenecks. One of
the most important innovation is multiplicity(??)
in
processor units,
datapaths, and
memory units.
This multiplicity is either entirely hidden from
the programmer, as in the case of implicit
parallelism, or exposed to the programmer in
different forms.

4
Introduction

Learning objectives in this chapter
An overview of important architecture concepts as
they relate to parallel processing.
To provide sufficient detail for programmers to
be able to write efficient code on a variety of
platforms.
It develops cost models and abstractions for
quantifying the performance of various parallel
algorithms, and identifying bottlenecks resulting
from various programming constructs.

5
Introduction

Parallelizing sub-optimal serial codes often has
undesirable effects of unreliable speedups and
misleading???runtimes.
It advocates??optimizing serial performance of
codes before attempting parallelization.
The tasks of serial and parallel optimization
often have very similar characteristics.

(????????????????,???????????????????????????,????
????????????)
6
Outline

Implicit Parallelism
Limitations of Memory System Performance
Dichotomy???of Parallel Computing Platforms
Physical Organization of Parallel Platforms
Communication Costs in Parallel Machines
Routing Mechanisms for Interconnection Networks
Impact of Processor-Processor Mapping and Mapping
Techniques
Case Studies

7
Implicit Parallelism ????

Trend in Microprocessor Architecture
Pipelining and Superscalar Execution
Very Long Instruction Word Processors VLIW

8
Trend in Microprocessor Architecture

Clock speeds of microprocessors have posted
impressive???????gains two to three orders of
magnitude over the past 20 years.
However, these increments are severely diluted??
by the limitations of memory technology.
Consequently, techniques that enable execution of
multiple instructions in a single clock cycle
have become popular.

9
Trend in Microprocessor Architecture

Mechanisms used by various processors for
supporting multiple instruction execution.
Pipelining and Superscalar Execution (?????????)
Very Long Instruction Word Processors (????????)

10
Pipelining and Superscalar Execution

By overlapping various stages in instruction
execution, pipelining enables faster execution.
(????????????????,????)
To increase the speed of a single pipeline, one
would break down the tasks into smaller and
smaller units, thus lengthening the pipeline and
increasing overlap in execution.
(?????????????????,??????????,?????)

11

Pipelining and Superscalar Execution

For example, the Pentium 4, which operate at 2.0
GHz, at a 20 stages pipeline.
Long instruction pipelines therefore need
effective techniques for predicting branch
destinations so that pipelines can be
speculatively??filled.????????????????????????,???
???????
An obvious way to improve instruction execution
rate beyond this level is to use multiple
pipelines.
During each clock cycle, multiple instructions
are piped into the processor in parallel.
.(????????????,????????,?????????????)

12
Superscalar Execution Example 2.1
Example of a two-way superscalar execution of
instructions.
13

Consider Fig. 2.1(a) first code
t0 The first and second instructions are
independent and therefore can be issued
concurrently.
t1
The next two instructions (row 3,4) are also
mutually independent, although they must be
executed after the first two instructions (t0)
They can be issued concurrently at t1 since the
processors are pipelined.
t2 Only add instruction is issued
t3 Only store instruction is issued
Two instructions (row 5,6) cannot be executed
concurrently since the result of the former is
used by the latter.

14
Superscalar Execution

Scheduling of instructions is determined by a
number of factors
True Data Dependency The result of one operation
is an input to the next.
Resource Dependency Two operations require the
same resource.
Branch Dependency Scheduling instructions across
conditional branch statements cannot be done
deterministically a-priori???.

15
Superscalar Execution

Scheduling of instructions is determined by a
number of factors
The scheduler, a piece of hardware looks at a
large number of instructions in an instruction
queue and selects appropriate number of
instructions to execute concurrently based on
these factors.
The complexity of this hardware is an important
constraint on superscalar processors.

16
Dependency- True data dependency

The results of an instruction must be required
for subsequent instructions.
Consider the 2nd code fragment, there is a true
data dependency between load R1, _at_1000 and add
R1, _at_1004,
Since the resolution is done at runtime, it must
supported in hardware. The complexity of this
hardware can be high.
The amount of instruction level of parallelism in
a program is often limited and is a function of
coding technique.

17
Dependency- True data dependency

In 2nd code fragment, there can be no
simultaneous issue, leading to poor resource
utilization.
The third code fragments also illustrate in many
cases it is possible to extract more parallelism
by reordering the instructions and by altering
the code.
The code reorganization corresponds to
exposing??parallelism in a form that can be used
by the instruction issue mechanism.

18
Dependency- Resource dependency

The form of dependency in which two instructions
compete for a single processor resource.
As an example, consider the co-scheduling of two
floating point operations on a dual issue machine
with a single floating point unit.
Although there might be no data dependencies
between the instructions, they cannot be
scheduled together since both need the floating
point unit.

19
Dependency- Branch or procedural dependencies

Since the branch destination is known only at the
point of execution, scheduling instructions a
priori across branches may lead to errors.
These dependencies are referred to as branch or
procedural dependencies and are typically handled
by speculatively scheduling across branches and
rolling back in case of errors.

20
Dependency- Branch or procedural dependencies

On average, a branch instruction is encountered
between every five to six instructions.
Therefore, just as in populating instruction
pipelines, accurate branch prediction is critical
for efficient superscalar execution.
The ability of a processor to detect and
concurrent instruction is critical to superscalar
performance.

21
Dependency- Branch or procedural dependencies

The 3rd code fragment is merely semantically
equivalent reordering of the 1st code fragment.
However, there is a data dependency between load
R1, _at_1000 and add R1, _at_1004 .
Therefore, these instructions cannot be issued
simultaneously. However, if the processor has
ability to look ahead, it would realize that it
is possible to schedule the 3rd instruction with
the 1st instruction.
In this way, the same execution schedule can be
derived for the 1st and 3rd code fragments.
However, the processor needs the ability to issue
instruction out-of-order to accomplish desired
ordering.

22
Dependency- Branch or procedural dependencies

Most current microprocessor are capable of
out-of-order issue and completion.
The model, also referred as dynamic instruction
issue, exploits maximum instruction level
parallelism. The processor uses a window of
instructions from which it selects instructions
for simultaneous issue. This window corresponds
to the look ahead of the scheduler. (Dynamic
Dependency Analysis)

23
Dependency-Branch or procedural dependencies

In Fig. 2.1(C)
These are essentially wasted cycles from the
point of view of the execution unit. If, during a
particular cycle, no instructions are issued on
the execution units, it is referred to as
vertical waste. If only part of the execution
units are used during a cycle, it is termed
horizontal waste.
In all, only three of the eight available cycles
are used for computation. This implies that the
code fragment will yield no more than
three-eighths of the peak rated FLOPS count of
the processor.

24
Dependency- Branch or procedural dependencies

Often, due to limited parallelism, resource
dependencies, or the ability of a processor to
extract parallelism, the resources of superscalar
processors are heavily under-utilized.
Current microprocessors typically support up to
four-issue superscalar execution.

25
Very Long Instruction Word Processors VLIW

The parallelism extracted by superscalar
processors is often limited by the instruction
look ahead.
The hardware logic for Dynamic Dependency
Analysis is typically in the range of 5-10 of
the total logic on conventional microprocessors.
The complexity grows roughly quadratic????with
the number of issues and become a bottleneck.

26
Very Long Instruction Word Processors (VLIW)

An alternate concept for exploiting
instruction-level parallelism used in every long
instruction word (VLIW) processors relies on the
compiler to resolve dependencies and resource
availability at compile time.

27
Very Long Instruction Word Processors VLIW

Instruction that can be executed concurrently are
packed into groups and parceled?? off the
processor as a single long instruction word to be
executed on multiple functional units at the same
time.

28
Very Long Instruction Word Processors (VLIW)

VLIW advantages
Since the schedule is done in software, the
decoding and instruction issue mechanisms are
simpler in VLIW processors.
The compiler has a larger context from which to
select instructions and can use a variety of
transformations to optimize parallelism when
compared to a hardware issue unit.
Additional parallel instructions are typically
made available to the compiler to control
parallel execution.

29
Very Long Instruction Word Processors VLIW

VLIW disadvantages
Compilers does not have the dynamic program state
(e.g. the branch history buffer) available to
make scheduling decisions.
This reduces the accuracy of branch and memory
prediction, but allows the use of more
sophisticated???static predictions schemas.
Others runtime situations are extremely
difficulty to predict accurately.
This limits the scope and performance of static
compiler-based scheduling.

30
Very Long Instruction Word Processors VLIW

VLIW is very sensitive to the compilers ability
to detect data and resource dependencies and R/W
hazards, and to schedule instructions for maximum
parallelism. Loop unrolling, branch prediction,
and speculative execution all play important
roles in the performance of VLIW processors.
While superscalar and VLIW processors have been
successful in exploiting implicit parallelism,
they are generally limited to smaller scales of
parallelism concurrency in the range of
four-to-eight-way parallelism.

31
Limitations of Memory System Performance
32
Limitations of Memory System Performance

Memory system, and not processor speed, is often
the bottleneck for many applications.
Memory system performance is largely captured by
two parameters, latency and bandwidth.

33
Limitations of Memory System Performance

Latency is the time from the issue of a memory
request to the time the data is available at the
processor.
Bandwidth is the rate at which data can be pumped
to the processor by the memory system.

34
Example2.2 Effect of memory latency on
performance

Consider a processor operating at 1GHz (1 ns
clock) connected to a DRAM with a latency of 100
ns (no caches) 100 cycles.
Assume that the processor has two multiple-add
units and is capable of executing four
instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS.
(4 FLOPS/cycle x 109 cycles/s4x109FLOPS)

35
Example2.2 Effect of memory latency on
performance

Since the memory latency is equal to 100 cycles
and the block size is one word, every time a
memory request is made, the processor must wait
100 cycles before it can process the data.
It is easy to see that the peak speed of this
computation is limited to one floating point
operation every 100 ns(10010-910-7), or a
speed of 10 MFLOPS (10106107).

36
Limitations of Memory System Performance

Improve Effective Memory Latency Using Caches
Impact of Memory Bandwidth
Alternate Approaches for Hiding Memory Latency
Multithreading for Latency Hiding
Prefetching for Latency Hiding
Tradeoffs of Multithreading and Prefetching

37
Improve Effective Memory Latency Using Caches

One innovation??address the speed mismatch by
placing a smaller and faster memory between the
processor and the DRAM.
The fraction of data references satisfied by the
cache is called the cache hit ratio.
The notation of repeated reference to a data item
in a small time window is called temporal
locality.

38
Improve Effective Memory Latency Using Caches

The effective computation rate of many
applications is bounded not by the processing
rate of the CPU, but by the rate at which data
can be pumped into the CPU. Such computations are
referred to as being memory bound.

39
Example2.3 Impact of caches on memory system
performance

As Example2.2, consider a 1GHz processor with a
100 ns latency DRAM and we introduce a cache of
size 32 KB with a latency of 1ns or one cycle. We
use this setup to multiply two matrices A and B
of Dimension 32x32
A32x322101K B32x322101K then, 1K1K2K
words2000 words
Fetching two matrices into cache takes about
2000x100ns 200 µs
Multiplying two n x n matrices takes 2n3
operations 2(32)3 64K operations

40
Example2.3 Impact of caches on memory system
performance

Because the processor has two multiple-add units
and is capable of executing four instructions in
each cycle of 1 ns.
Then 64K4x16K cycles (or 16 µs) at four
instructions per cycle
Total 200 µs 16 µs 216 µs
Peak computation rate 64K / 216 303.4074 x
106 303 MFLOPS
Compare with example 2.2
Improvement ratio 303 / 10 30.3 about thirty
fold

41
Impact of Memory Bandwidth

Memory Bandwidth
The rate at which data can be moved between the
processor and memory.
It is determined by the memory bus as well as the
memory units.
The single memory request returns a
contiguous???block of four words. The single unit
of four words in this case is also referred to as
a cache line.

42
Impact of Memory Bandwidth

In following example, the data layouts were
assumed to be such that consecutive data words in
memory were used by successive instructions. In
other words, if we take a computation-centric
view, there is a spatial locality of memory
access.

43
Example2.4 Effect of block size dot-product of
two vectors

A peak speed of 10 MFLOPS as illustrated in
example 2.2
If the block size is increased to four words
i.e., the processor can fetch a four-word cache
line every 100 cycles
For each pair of words, the dot-product performs
one multiply-add, i.e., 2 FLOPs,
then four words need 8 FLOPs can be performed in
2x100 200 cycles
The corresponds to a FLOP every 200/8 25 ns, for
a peak of 1/25ns109/25 40 MFLOPs

44
Impact of Memory Bandwidth

If we take a data-layout centric point view, the
computation is ordered so that successive
computations require contiguous data.
If the computation (or access pattern) does not
have spatial locality, then effective bandwidth
can be much smaller than the peak bandwidth.

45
Row majority vs. Column Majority

Row majority
for( i0 ilt100 i )
for( j0jlt100j )
aijbijcij
Column majority
for( j0jlt100j )
for( i0 ilt100 i )
aijbijcij

46
Impact of strided??access Example 2.5

Consider the following code fragment
for (i 0 i lt 1000 i)
column_sumi 0.0
for (j 0 j lt 1000 j)
column_sumi Aji
The code fragment sums columns of the matrix A
into a vector column_sum.
Assumption the matrix has been stored in a
row-major fashion in memory.

Example2.5 Impact of strided? access

Example 2.5 column sum
Example 2.6 column sum II
48
Eliminating strided access Example 2.6

We can fix the above code as follows
for (i 0 i lt 1000 i)
column_sumi 0.0
for (j 0 j lt 1000 j)
for (i 0 i lt 1000 i)
column_sumi Aji
In this case, the matrix is traversed in a
row-order and performance can be expected to be
significantly better.

49
Memory System Performance Summary

The series of examples presented in this section
illustrate the following concepts
Exploiting spatial and temporal locality in
applications is critical for amortizing??memory
latency and increasing effective memory
bandwidth.
The ratio of the number of operations to number
of memory accesses is a good indicator of
anticipated??tolerance to memory bandwidth.
Memory layouts and organizing computation
appropriately can make a significant impact on
the spatial and temporal locality.

50
Alternate Approaches for Hiding Memory Latency

Imaging sitting at your computer browsing the web
during peak network network traffic hours. The
lack of response from your browser can be
alleviated??
Multithreading for Latency Hidinglike we open
multiple browsers and access different pages in
each browser,thus while we are waiting for one
page to load, we could be reading others
Prefetching for Latency Hiding like we
anticipate??which pages we are going to browse
ahead of time and issue requests for them in
advance.
Spatial locality in accessing memory wordslike
we access a whole bunch?of pages in one go.

51
Multithreading for Latency Hiding

A thread is a single stream of control in the
flow of a program.
We illustrate threads with a simple example 2.7
for (i 0 i lt n i)
ci dot_product(get_row(a, i), b)
Each dot-product is independent of the other, and
therefore represents a concurrent unit of
execution. We can safely rewrite the above code
segment as
for (i 0 i lt n i)
ci create_thread(dot_product,get_row(a,
i), b)

52
Multithreading for Latency Hiding Example 2.7

In the code, the first instance of this function
accesses a pair of vector elements and waits for
them.
In the meantime, the second instance of this
function can access two other vector elements in
the next cycle, and so on.
After l units of time, where l is the latency of
the memory system, the first function instance
gets the requested data from memory and can
perform the required computation.
In the next cycle, the data items for the next
function instance arrive, and so on.
In this way, in every clock cycle, we can perform
a computation.

53
Multithreading for Latency Hiding

The execution schedule in the previous example is
predicated upon two assumptions
the memory system is capable of servicing
multiple outstanding???requests, and
the processor is capable of switching threads at
every cycle.

54
Multithreading for Latency Hiding

It also requires the program to have an explicit
specification of concurrency in the form of
threads.
Machines such as the HEP and Tera rely on
multithreaded processors that can switch the
context of execution in every cycle.
Consequently, they are able to hide latency
effectively.

55
Prefetching for Latency Hiding

Misses on loads cause programs to stall.
Why not advance the loads so that by the time the
data is actually needed, it is already there!
The only drawback is that you might need more
space to store advanced loads.
However, if the advanced loads are overwritten,
we are no worse than before!

56
Example 2.8 Hiding latency of perfecting

Consider the problem of adding two vectors a and
b using a single loop.
In the first iteration of the loop
The processor request a0 and b0
Since these are not in the cache, the processor
must pay the memory latency.
While these requests are being serviced, the
processor also requests a1 and b1.
Assuming that each request is generated in one
cycle (1ns) and memory requests are satisfied in
100 ns
After 100 such requests the first set of data
items is return by the memory system
Subsequently, one pair of vector components will
be returned every cycle.

57
Example2.9 Impact of bandwidth on multithreaded
programs

Consider a computation running on a machine with
a 1GHz clock, 4 word cache line, single cycle
access to the cache, and 100 ns latency to DRAM.
The computation has a cache hit ratio at 1 KB of
25 and at 32 KB of 90.
A single threaded execution in which the entire
cache (32KB) is available to the serial context
A multithreaded execution with 32 threads where
each thread has a cache residency of 1 KB
If the computation makes one data request in
every cycle of 1 ns

58
Example2.9

In the first case
DRAM latency 100ns
4 words/cycle computation
4 words/ns ( 4-ways)
100ns need support 400 words to CPU
1s need support 107x400words4000MB
10 from DRAM then, 10x4000MB400MB/s
DRAM bandwidth 400MB/s
A single thread

59
Example2.9

In the second case,
DRAM latency 100ns
4 words/cycle computation
4 words/ns ( 4-ways)
100ns need support 400 words to CPU
1s need support 107x400words4000MB
75 from DRAM then, 75x4000MB3000MB3GB
DRAM bandwidth 3GB/s
32 threads

60
Tradeoffs of Multithreading and Prefetching

Bandwidth requirements of a multithreaded system
may increase very significantly because of the
smaller cache residency of each thread.
Multithreaded systems become bandwidth bound
instead of latency bound.
Multithreading and prefetching only address the
latency problem and may often exacerbate?? the
bandwidth problem.
Multithreading and prefetching also require
significantly more hardware resources in the form
of storage.

61
Dichotomy???of Parallel Computing Platforms
62
Dichotomy of Parallel Computing Platforms

Logical
Control Structure of Parallel Platformsthe
former
Communication Model of Parallel Platforms(chap
10) the latter
Shared-Address-Space Platforms(chap 07)
Message-Passing Platforms(chap 06)
Physical
Architecture of an ideal Parallel Computer
Interconnection Networks for Parallel Computers
Network Topologies
Evaluating Static Interconnection Networks
Evaluating Dynamic Interconnection Networks
Cache Coherence in Multiprocessor Systems

63
Control Structure of Parallel Programs

Parallelism can be expressed at various levels of
granularity??- from instruction level to
processes.
Between these extremes exist a range of models,
along with corresponding architectural support.

64
Example2.10 Parallelism from single instruction
on multiple processors

Consider the following code segment that adds two
vectors
for (i0 ilt1000 i)
ciaibi
C0a0b0C1a1b1..etc.,can be
executed independently of each other.
If there is a mechanism for executing the same
instruction, all the processors with appropriate
data, we could execute this loop much faster.

65
SIMD and MIMD
66
SIMD

SIMD (Single instruction stream, multiple data
stream) Architecture
A single control unit dispatches instructions to
each processing unit.
In an SIMD parallel computer, the same
instruction is executed synchronously by all
processing units.
These Architectural enhancements rely on the
highly structured (regular) nature of underlying
computations, for example in image processing and
graphics, to deliver improved performance.

67
MIMD

MIMD (Multiple instruction stream, multiple data
stream) Architecture
Computers in which each processing element can
execute a different program independence of the
other processing elements
A simple variant of this model, called the
single program multiple data (SPMD), relies on
multiple instances of the same program executing
on different data.
The SPMD model is widely used by many parallel
platforms and requires minimal architecture
support. Examples of such platforms include the
SUN Ultra Servers, Microprocessor PCs,
workstation clusters, and the IBM SP.

68
SIMD vs. MIMD

SIMD computers require less hardware than MIMD
computers because they only one global control
unit.
Furthermore, SIMD computers require less memory
because only one copy of the program needs to be
stored.
Platforms supporting the SIMD paradigm can be
built from inexpensive off-the-shelf components
with relatively little effort in a short amount
of time.

69
SIMD Disadvantages

Since the underlying serial processors change so
rapidly, SIMD computers suffer from fast
obsolescence??.
The irregular nature of many applications makes
SIMD architecture less suitable.
Example 2.11 illustrates a case in which SIMD
architectures yield poor resource utilization in
the conditional execution.

70
Example2.11 Conditional Execution in SIMD
Processors
71
Communication Model of Parallel Platforms

There are two primary forms of data exchange
between parallel tasks
Shared-Address-Space Platforms(ch 07)
Message-Passing Platforms(ch 06)

72
Shared-Address-Space Platforms

Part (or all) of the memory is accessible to all
processors.
Processors interact by modifying data objects
stored in this shared-address-space.
If the time taken by a processor to access any
memory word in the system global or local is
identical, the platform is classified as a
uniform memory access (UMA), else, a non-uniform
memory access (NUMA) machine.

73
Shared-Address-Space Platforms
74
Shared-Address-Space Platforms

The Shared-Address-Space view of a parallel
supports a common data space that is accessible
to all processors.
Processors interact by modifying data objects
stored in this Shared-Address-Space.
Memory in Shared-Address-Space platforms can be
local or global
Shared-Address-Space platforms supporting SPMD
programming are also referred to as
multiprocessors.

75
Shared-Address-Space vs. Shared Memory Machines

It is important to note the difference between
the terms shared address space and shared memory.
We refer to the former as a programming
abstraction and to the latter as a physical
machine attribute.
It is possible to provide a shared address space
using a physically distributed memory

76
Message-Passing Platforms

These platforms comprise of a set of processors
and their own (exclusive) memory.
Instances of such a view come naturally from
clustered workstations and non-shared-address-spac
e multicomputers.
These platforms are programmed using (variants
of) send and receive primitives.
Libraries such as MPI and PVM provide such
primitives.

77
Message Passing vs. Shared Address Space
Platforms

Message passing requires little hardware support,
other than a network.
Shared address space platforms can easily
emulate??message passing. The reverse is more
difficult to do (in an efficient manner).

Write a Comment

User Comments (0)