Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing

Description:

... independent processors into a single package, often a single integrated circuit. ... can be used to interconnect distributed computational resources and present ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 62

Provided by: canoz

Category:

more less

Transcript and Presenter's Notes

Title: Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing

1
Hardware Parallel/Distributed Processing High
Performance ComputingTop 500 listGrid computing
CMPE 49B Spec. Topics in CMPE Multi-core
Programming
picture of ASCI WHITE, the most
powerful computer in the world (2001)
2
Von Neumann Architecture
CPU
RAM
Device
Device
BUS

sequential computer

3
Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CD
4
History of Computer Architecture

4 Generations (identified by logic technology)
Tubes
Transistors
Integrated Circuits
VLSI (very large scale integration)

5
PERFORMANCE TRENDS
6
PERFORMANCE TRENDS

Traditional mainframe/supercomputer performance
25 increase per year
But microprocessor performance 50 increase per
year since mid 80s.

7
Moores Law

Transistor density doubles every 18 months
Moore is co-founder of Intel.
60 increase per year
Exponential growth
PC costs decline.
PCs are building bricks of all future systems.

8
VLSI Generation
9
Bit Level Parallelism(upto mid 80s)

4 bit microprocessors replaced by 8 bit, 16 bit,
32 bit etc.
doubling the width of the datapath reduces the
number of cycles required to perform a full
32-bit operation
mid 80s reap benefits of this kind of
parallelism (full 32-bit word operations combined
with the use of caches)

10
Instruction Level Parallelism(mid 80s to mid
90s)

Basic steps in instruction processing
(instruction decode, integer arithmetic, address
calculations, could be performed in a single
cycle)
Pipelined instruction processing
Reduced instruction set (RISC)
Superscalar execution
Branch prediction

11
Thread/Process Level Parallelism(mid 90s to
present)

On average control transfers occur roughly once
in five instructions, so exploiting instruction
level parallelism at a larger scale is not
possible
Use multiple independent threads or processes
Concurrently running threads, processes

12
Evolution of the Infrastructure

Electronic Accounting Machine Era 1930-1950
General Purpose Mainframe and Minicomputer Era
1959-Present
Personal Computer Era 1981 Present
Client/Server Era 1983 Present
Enterprise Internet Computing Era 1992- Present

13
Sequential vs Parallel Processing

physical limits reached
easy to program
expensive supercomputers

raw power unlimited
more memory, multiple cache
made up of COTS, so cheap
difficult to program

14
What is Multi-Core Programming ?

Answer It is basically parallel programming on
a single computer box (e.g. a desktop, a
notebook, a blade)

15
Amdahls Law

The serial percentage of a program is fixed. So
speed-up obtained by employing parallel
processing is bounded.
Lead to pessimism in in the parallel processing
community and prevented development of parallel
machines for a long time.

1
Speedup
1-s
s
P

In the limit
Spedup 1/s

s
16
Gustafsons Law

Serial percentage is dependent on the number of
processors/input.
Broke/disproved Amdahls law.
Demonstrated achieving more than 1000 fold
speedup using 1024 processors.
Justified parallel processing

17
Grand Challenge Applications

Important scientific engineering problems
identified by U.S. High Performance Computing
Communications Program (92)

18
Flynns Taxonomy

classifies computer architectures according to
Number of instruction streams it can process at a
time
Number of data elements on which it can operate
simultaneously

Data Streams
Single Multiple
Single
SIMD
SISD
Instruction Streams
Multiple
MISD
MIMD
19
SPMD Model (Single Program Multiple Data)

Each processor executes the same program
asynchronously
Synchronization takes place only when processors
need to exchange data
SPMD is extension of SIMD (relax synchronized
instruction execution)
SPMD is restriction of MIMD (use only one
source/object)

20
Parallel Processing Terminology

Embarassingly Parallel
applications which are trivial to parallelize
large amounts of independent computation
Little communication
Data Parallelism
model of parallel computing in which a single
operation can be applied to all data elements
simultaneously
amenable to SIMD or SPMD style of computation
Control Parallelism
many different operations may be executed
concurrently
require MIMD/SPMD style of computation

21
Parallel Processing Terminology

Scalability
If the size of problem is increased, number of
processors that can be effectively used can be
increased (i.e. there is no limit on
parallelism).
Cost of scalable algorithm grows slowly as input
size and the number of processors are increased.
Data parallel algorithms are more scalable than
control parallel alorithms
Granularity
fine grain machines employ massive number of
weak processors each with small memory
coarse grain machines smaller number of powerful
processors each with large amounts of memory

22
Shared Memory Machines

Memory is globally shared, therefore processes
(threads) see single address
space

Coordination of accesses to locations done by use
of locks provided by
thread libraries

Example Machines Sequent, Alliant, SUN Ultra,
Dual/Quad Board Pentium PC

Example Thread Libraries POSIX threads, Linux
threads.

23
Shared Memory Machines

can be classified as
UMA uniform memory access
NUMA nonuniform memory access
based on the amount of time a processor takes to
access local and global memory.

P M P M .. P M
Inter- connection network
P M P M .. P M
Inter- connection network
M M M .. M
P P .. P
M M .. M
Inter- connection network/ or BUS
(a)
(c)
(b)
24
Distributed Memory Machines

Each processor has its own local memory (not
directly accessible by others)

Processors communicate by passing messages to
each other

Example Machines IBM SP2, Intel Paragon, COWs
(cluster of workstations)

Example Message Passing Libraries PVM, MPI

25
Beowulf Clusters

Use COTS, ordinary PCs and networking equipment
Has the best price/performance ratio

PC cluster
26
Multi-Core Computing

A multi-core microprocessor is one which combines
two or more independent processors into a single
package, often a single integrated circuit.
A dual-core device contains only two independent
microprocessors.

27
Comparison of Different Architectures
Single Core Architecture
28
Comparison of Different Architectures
Multiprocessor
29
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Cache
Hyper-Threading Technology
30
Comparison of Different Architectures
Multi-Core Architecture
31
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Execution unit
Cache
Multi-Core Architecture with Shared Cache
32
Comparison of Different Architectures
Multi-Core with Hyper-Threading Technology
33
(No Transcript)
34
Top 10 Most Powerful Computers in the World (as
of 6/2006)
35
Most Powerful Computers in the World (as of
11/2007)
36
Top 500 Lists

http//www.top500.org/list/2007/11
http//www.top500.org/list/2007/06
..

37
Application Areas in Top 500 List
38
Top 500 Statistics

http//www.top500.org/stats

39
Grid Computing

provide access to computing power and various
resources just like accessing electrical power
from electrical grid
Allows coupling of geographically distributed
resources
Provide inexpensive access to resources
irrespective of their physical location or access
point
Internet dedicated networks can be used to
interconnect distributed computational resources
and present them as a single unified resource
Resources supercomputers, clusters, storage
systems, data resources, special devices

40
Grid Computing

the GRID is, in effect, a set of software tools,
which when combined with hardware, would let
users tap processing power off the Internet as
easily as the electrical power can be drawn from
the electricty grid.
Examples of Grids
-TeraGrid (USA) http//www.teragrid.org
-EGEE Grid (Europe) http//www.eu-egee.org/
TR-Grid (Turkey) http//www.grid.org.tr/
Sun Grid Compute Utility (Commercial,
pay-per-use) http//www.network.com/

41
Models of Parallel Computers

Message Passing Model
Distributed memory
Multicomputer
2. Shared Memory Model
Multiprocessor
Multi-core
3. Theoretical Model
PRAM
New architectures combination of 1 and 2.

42
Theoretical PRAM Model

Used by parallel algorithm designers
Algorithm designers do not want to worry about
low level details They want to concentrate on
algorithmic details
Extends classic RAM model
Consist of
Control unit (common clock), synchronous
Global shared memory
Unbounded set of processors, each with its
private own memory

43
Theoretical PRAM Model

Some characteristics
Each processor has a unique identifier,
mypid0,1,2,
All processors operate synhronously under the
control of a common clock
In each unit of time, each procesor is allowed to
execute an instruction or stay idle

44
Various PRAM Models
weakest
(how write conflicts to the same memory location
are handled)
strongest
45
Algorithmic Performance Parameters

Notation

Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when
run on P processors
Time complexity of the parallel algorithm when
run on 1 processors
46
Algorithmic Performance Parameters

Speed-Up
Efficiency

47
Algorithmic Performance Parameters

Work Processors X Time
Informally How much time a parallel algorithm
will take to simulate on a serial machine
Formally

48
Algorithmic Performance Parameters

Work Efficient
Informally a work efficient parallel algorithm
does no more work than the best serial algorithm
Formally a work efficient algorithm satisfies

49
Algorithmic Performance Parameters

Scalability
Informally, scalability implies that if the size
of the problem is increased, the number of
processors effectively used can be increased
(i.e. there is no limit on parallelism)
Formally, scalability means

50
Algorithmic Performance Parameters

Some remarks
Cost of scalable algorithm grows slowly as input
size and the number of procesors are increased
Level of control parallelism is usually a
constant independent of problem size
Level of data parallelism is an increasing
function of problem size
Data parallel algorithms are more scalable than
control parallel algorithms

51
Goals in Designing Parallel Algorithms

Scalability
Algorithm cost grows slowly, preferably in a
polylogarithmic manner
Work Efficient
We do not want to waste CPU cycles
May be an important point when we are worried
about power consumption or money paid for CPU
usage

52
Summing N numbers in Parallel
step 1
x1..x4
result

Array of N numbers can be summed in log(N) steps
using

N/2 processors
53
Prefix Summing N numbers in Parallel
step 1
step 2
x1..x4
x2..x4
x3..x6
x4..x7
x5..x8
x6..x8
x7x8
x8
step 3
x1..x8
x2..x8
x3..x8
x4..x8
x5..x8
x6..x8
x7x8
x8

Computing partial sums of an array of N numbers
can be done in
log(N) steps using N processors

54
Prefix Paradigm for Parallel Algorithm Design

Prefix computation forms a paradigm for parallel
algorithm
development, just like other well known
paradigms such as
divide and conquer, dynamic programming, etc.

Prefix Paradigm
If possible, transform your problem to prefix
type
computation
Apply the efficient logarithmic prefix
computation

Examples of Problems solved by Prefix Paradigm

Solving linear recurrence equations
Tridiagonal Solver
Problems on trees
Adaptive triangular mesh refinement

55
Solving Linear Recurrence Equations

Given the linear recurrence equation
we can rewrite it as
if we expand it, we get the solution in terms of
partial products of coefficients and the initial
values z1 and z0
use prefix to compute partial products

56
Pointer Jumping Technique
step 1
step 2
step 3

A linked list of N numbers can be prefix-summed
in log(N)

steps using N processors
57
Euler Tour Technique
Tree Problems

Preorder numbering
Postorder numbering
Number of Descendants
Level of each node

To solve such problems, first transform the tree
by linearizing it
into a linked-list and then apply the prefix
computation

58
Computing Level of Each Node by Euler Tour
Technique
-1
weight assignment
1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
level(v) pw(ltv,parent(v)gt) level(root)
0
-1
1
1
-1
w(ltu,vgt)
initial weights
-1
1
1
1
-1
1
-1
-1
-1
1
-1
1
1
-1
1
-1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
0
0
0
1
1
2
1
1
2
1
2
3
2
3
2
1
pw(ltu,vgt)
prefix
59
Computing Number of Descendants by Euler Tour
Technique
1
weight assignment
0
0
1
1
0
1
0
0
1
0
1
1
0
of descendants(v) pw(ltparent(v),vgt) -

pw(ltv,parent(v)gt) of descendants(root) n
1
0
0
1
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
60
Preorder Numbering by Euler Tour Technique
1
0
weight assignment
1
1
0
2
0
1
0
1
1
8
9
0
1
0
0
5
1
3
4
preorder(v) 1 pw(ltv,parent(v)gt) preorder(
root) 1
0
1
1
0
6
7
w(ltu,vgt)
initial weights
0
1
1
1
0
1
0
0
0
1
0
1
1
0
1
0
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
1
2
3
6
8
2
3
4
5
5
6
6
7
pw(ltu,vgt)
prefix
61
Postorder Numbering by Euler Tour Technique
9
1
weight assignment
0
0
6
1
1
0
1
0
8
0
7
1
0
1
1
0
5
2
postorder(v) pw(ltparent(v),vgt) postorder(ro
ot) n
1
1
0
0
1
3
4
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix

Write a Comment

User Comments (0)