Title: Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing
1Hardware Parallel/Distributed Processing High
Performance ComputingTop 500 listGrid computing
CMPE 49B Spec. Topics in CMPE Multi-core
Programming
picture of ASCI WHITE, the most
powerful computer in the world (2001)
2Von Neumann Architecture
CPU
RAM
Device
Device
BUS
3Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CD
4History of Computer Architecture
- 4 Generations (identified by logic technology)
- Tubes
- Transistors
- Integrated Circuits
- VLSI (very large scale integration)
5PERFORMANCE TRENDS
6PERFORMANCE TRENDS
- Traditional mainframe/supercomputer performance
25 increase per year - But microprocessor performance 50 increase per
year since mid 80s.
7Moores Law
- Transistor density doubles every 18 months
- Moore is co-founder of Intel.
- 60 increase per year
- Exponential growth
- PC costs decline.
- PCs are building bricks of all future systems.
8VLSI Generation
9Bit Level Parallelism(upto mid 80s)
- 4 bit microprocessors replaced by 8 bit, 16 bit,
32 bit etc. - doubling the width of the datapath reduces the
number of cycles required to perform a full
32-bit operation - mid 80s reap benefits of this kind of
parallelism (full 32-bit word operations combined
with the use of caches)
10Instruction Level Parallelism(mid 80s to mid
90s)
- Basic steps in instruction processing
(instruction decode, integer arithmetic, address
calculations, could be performed in a single
cycle) - Pipelined instruction processing
- Reduced instruction set (RISC)
- Superscalar execution
- Branch prediction
11Thread/Process Level Parallelism(mid 90s to
present)
- On average control transfers occur roughly once
in five instructions, so exploiting instruction
level parallelism at a larger scale is not
possible - Use multiple independent threads or processes
- Concurrently running threads, processes
12Evolution of the Infrastructure
- Electronic Accounting Machine Era 1930-1950
- General Purpose Mainframe and Minicomputer Era
1959-Present - Personal Computer Era 1981 Present
- Client/Server Era 1983 Present
- Enterprise Internet Computing Era 1992- Present
13Sequential vs Parallel Processing
- physical limits reached
- easy to program
- expensive supercomputers
- raw power unlimited
- more memory, multiple cache
- made up of COTS, so cheap
- difficult to program
14What is Multi-Core Programming ?
- Answer It is basically parallel programming on
a single computer box (e.g. a desktop, a
notebook, a blade)
15Amdahls Law
- The serial percentage of a program is fixed. So
speed-up obtained by employing parallel
processing is bounded. - Lead to pessimism in in the parallel processing
community and prevented development of parallel
machines for a long time.
1
Speedup
1-s
s
P
s
16Gustafsons Law
- Serial percentage is dependent on the number of
processors/input. - Broke/disproved Amdahls law.
- Demonstrated achieving more than 1000 fold
speedup using 1024 processors. - Justified parallel processing
17Grand Challenge Applications
- Important scientific engineering problems
identified by U.S. High Performance Computing
Communications Program (92)
18Flynns Taxonomy
- classifies computer architectures according to
- Number of instruction streams it can process at a
time - Number of data elements on which it can operate
simultaneously
Data Streams
Single Multiple
Single
SIMD
SISD
Instruction Streams
Multiple
MISD
MIMD
19SPMD Model (Single Program Multiple Data)
- Each processor executes the same program
asynchronously - Synchronization takes place only when processors
need to exchange data - SPMD is extension of SIMD (relax synchronized
instruction execution) - SPMD is restriction of MIMD (use only one
source/object)
20Parallel Processing Terminology
- Embarassingly Parallel
- applications which are trivial to parallelize
- large amounts of independent computation
- Little communication
- Data Parallelism
- model of parallel computing in which a single
operation can be applied to all data elements
simultaneously - amenable to SIMD or SPMD style of computation
- Control Parallelism
- many different operations may be executed
concurrently - require MIMD/SPMD style of computation
21Parallel Processing Terminology
- Scalability
- If the size of problem is increased, number of
processors that can be effectively used can be
increased (i.e. there is no limit on
parallelism). - Cost of scalable algorithm grows slowly as input
size and the number of processors are increased. - Data parallel algorithms are more scalable than
control parallel alorithms - Granularity
- fine grain machines employ massive number of
weak processors each with small memory - coarse grain machines smaller number of powerful
processors each with large amounts of memory
22Shared Memory Machines
- Memory is globally shared, therefore processes
(threads) see single address - space
- Coordination of accesses to locations done by use
of locks provided by - thread libraries
- Example Machines Sequent, Alliant, SUN Ultra,
Dual/Quad Board Pentium PC
- Example Thread Libraries POSIX threads, Linux
threads.
23Shared Memory Machines
- can be classified as
- UMA uniform memory access
- NUMA nonuniform memory access
- based on the amount of time a processor takes to
access local and global memory.
P M P M .. P M
Inter- connection network
P M P M .. P M
Inter- connection network
M M M .. M
P P .. P
M M .. M
Inter- connection network/ or BUS
(a)
(c)
(b)
24 Distributed Memory Machines
- Each processor has its own local memory (not
directly accessible by others)
- Processors communicate by passing messages to
each other
- Example Machines IBM SP2, Intel Paragon, COWs
(cluster of workstations)
- Example Message Passing Libraries PVM, MPI
25Beowulf Clusters
- Use COTS, ordinary PCs and networking equipment
- Has the best price/performance ratio
PC cluster
26Multi-Core Computing
- A multi-core microprocessor is one which combines
two or more independent processors into a single
package, often a single integrated circuit. - A dual-core device contains only two independent
microprocessors.
27Comparison of Different Architectures
Single Core Architecture
28Comparison of Different Architectures
Multiprocessor
29Comparison of Different Architectures
CPU State
CPU State
Execution unit
Cache
Hyper-Threading Technology
30Comparison of Different Architectures
Multi-Core Architecture
31Comparison of Different Architectures
CPU State
CPU State
Execution unit
Execution unit
Cache
Multi-Core Architecture with Shared Cache
32Comparison of Different Architectures
Multi-Core with Hyper-Threading Technology
33(No Transcript)
34Top 10 Most Powerful Computers in the World (as
of 6/2006)
35Most Powerful Computers in the World (as of
11/2007)
36 Top 500 Lists
- http//www.top500.org/list/2007/11
- http//www.top500.org/list/2007/06
- ..
37Application Areas in Top 500 List
38Top 500 Statistics
- http//www.top500.org/stats
39Grid Computing
- provide access to computing power and various
resources just like accessing electrical power
from electrical grid - Allows coupling of geographically distributed
resources - Provide inexpensive access to resources
irrespective of their physical location or access
point - Internet dedicated networks can be used to
interconnect distributed computational resources
and present them as a single unified resource - Resources supercomputers, clusters, storage
systems, data resources, special devices
40Grid Computing
- the GRID is, in effect, a set of software tools,
which when combined with hardware, would let
users tap processing power off the Internet as
easily as the electrical power can be drawn from
the electricty grid. - Examples of Grids
- -TeraGrid (USA) http//www.teragrid.org
- -EGEE Grid (Europe) http//www.eu-egee.org/
- TR-Grid (Turkey) http//www.grid.org.tr/
- Sun Grid Compute Utility (Commercial,
pay-per-use) http//www.network.com/
41Models of Parallel Computers
- Message Passing Model
- Distributed memory
- Multicomputer
- 2. Shared Memory Model
- Multiprocessor
- Multi-core
- 3. Theoretical Model
- PRAM
- New architectures combination of 1 and 2.
42Theoretical PRAM Model
- Used by parallel algorithm designers
- Algorithm designers do not want to worry about
low level details They want to concentrate on
algorithmic details - Extends classic RAM model
- Consist of
- Control unit (common clock), synchronous
- Global shared memory
- Unbounded set of processors, each with its
private own memory
43Theoretical PRAM Model
- Some characteristics
- Each processor has a unique identifier,
mypid0,1,2, - All processors operate synhronously under the
control of a common clock - In each unit of time, each procesor is allowed to
execute an instruction or stay idle
44Various PRAM Models
weakest
(how write conflicts to the same memory location
are handled)
strongest
45Algorithmic Performance Parameters
Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when
run on P processors
Time complexity of the parallel algorithm when
run on 1 processors
46Algorithmic Performance Parameters
47Algorithmic Performance Parameters
- Work Processors X Time
- Informally How much time a parallel algorithm
will take to simulate on a serial machine - Formally
48Algorithmic Performance Parameters
- Work Efficient
- Informally a work efficient parallel algorithm
does no more work than the best serial algorithm - Formally a work efficient algorithm satisfies
49Algorithmic Performance Parameters
- Scalability
- Informally, scalability implies that if the size
of the problem is increased, the number of
processors effectively used can be increased
(i.e. there is no limit on parallelism) - Formally, scalability means
50Algorithmic Performance Parameters
- Some remarks
- Cost of scalable algorithm grows slowly as input
size and the number of procesors are increased - Level of control parallelism is usually a
constant independent of problem size - Level of data parallelism is an increasing
function of problem size - Data parallel algorithms are more scalable than
control parallel algorithms
51Goals in Designing Parallel Algorithms
- Scalability
- Algorithm cost grows slowly, preferably in a
polylogarithmic manner - Work Efficient
- We do not want to waste CPU cycles
- May be an important point when we are worried
about power consumption or money paid for CPU
usage
52Summing N numbers in Parallel
step 1
x1..x4
result
- Array of N numbers can be summed in log(N) steps
using
N/2 processors
53Prefix Summing N numbers in Parallel
step 1
step 2
x1..x4
x2..x4
x3..x6
x4..x7
x5..x8
x6..x8
x7x8
x8
step 3
x1..x8
x2..x8
x3..x8
x4..x8
x5..x8
x6..x8
x7x8
x8
- Computing partial sums of an array of N numbers
can be done in - log(N) steps using N processors
54Prefix Paradigm for Parallel Algorithm Design
- Prefix computation forms a paradigm for parallel
algorithm - development, just like other well known
paradigms such as - divide and conquer, dynamic programming, etc.
- Prefix Paradigm
- If possible, transform your problem to prefix
type - computation
- Apply the efficient logarithmic prefix
computation
- Examples of Problems solved by Prefix Paradigm
- Solving linear recurrence equations
- Tridiagonal Solver
- Problems on trees
- Adaptive triangular mesh refinement
55Solving Linear Recurrence Equations
- Given the linear recurrence equation
- we can rewrite it as
- if we expand it, we get the solution in terms of
partial products of coefficients and the initial
values z1 and z0 - use prefix to compute partial products
56Pointer Jumping Technique
step 1
step 2
step 3
- A linked list of N numbers can be prefix-summed
in log(N)
steps using N processors
57Euler Tour Technique
Tree Problems
- Preorder numbering
- Postorder numbering
- Number of Descendants
- Level of each node
- To solve such problems, first transform the tree
by linearizing it - into a linked-list and then apply the prefix
computation
58Computing Level of Each Node by Euler Tour
Technique
-1
weight assignment
1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
level(v) pw(ltv,parent(v)gt) level(root)
0
-1
1
1
-1
w(ltu,vgt)
initial weights
-1
1
1
1
-1
1
-1
-1
-1
1
-1
1
1
-1
1
-1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
0
0
0
1
1
2
1
1
2
1
2
3
2
3
2
1
pw(ltu,vgt)
prefix
59Computing Number of Descendants by Euler Tour
Technique
1
weight assignment
0
0
1
1
0
1
0
0
1
0
1
1
0
of descendants(v) pw(ltparent(v),vgt) -
pw(ltv,parent(v)gt) of descendants(root) n
1
0
0
1
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
60Preorder Numbering by Euler Tour Technique
1
0
weight assignment
1
1
0
2
0
1
0
1
1
8
9
0
1
0
0
5
1
3
4
preorder(v) 1 pw(ltv,parent(v)gt) preorder(
root) 1
0
1
1
0
6
7
w(ltu,vgt)
initial weights
0
1
1
1
0
1
0
0
0
1
0
1
1
0
1
0
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
1
2
3
6
8
2
3
4
5
5
6
6
7
pw(ltu,vgt)
prefix
61Postorder Numbering by Euler Tour Technique
9
1
weight assignment
0
0
6
1
1
0
1
0
8
0
7
1
0
1
1
0
5
2
postorder(v) pw(ltparent(v),vgt) postorder(ro
ot) n
1
1
0
0
1
3
4
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix