Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing

Description:

... independent processors into a single package, often a single integrated circuit. ... can be used to interconnect distributed computational resources and present ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 62
Provided by: canoz
Category:

less

Transcript and Presenter's Notes

Title: Hardware ParallelDistributed Processing High Performance Computing Top 500 list Grid computing


1
Hardware Parallel/Distributed Processing High
Performance ComputingTop 500 listGrid computing
CMPE 49B Spec. Topics in CMPE Multi-core
Programming
picture of ASCI WHITE, the most
powerful computer in the world (2001)
2
Von Neumann Architecture
CPU
RAM
Device
Device
BUS
  • sequential computer

3
Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CD
4
History of Computer Architecture
  • 4 Generations (identified by logic technology)
  • Tubes
  • Transistors
  • Integrated Circuits
  • VLSI (very large scale integration)

5
PERFORMANCE TRENDS
6
PERFORMANCE TRENDS
  • Traditional mainframe/supercomputer performance
    25 increase per year
  • But microprocessor performance 50 increase per
    year since mid 80s.

7
Moores Law
  • Transistor density doubles every 18 months
  • Moore is co-founder of Intel.
  • 60 increase per year
  • Exponential growth
  • PC costs decline.
  • PCs are building bricks of all future systems.

8
VLSI Generation
9
Bit Level Parallelism(upto mid 80s)
  • 4 bit microprocessors replaced by 8 bit, 16 bit,
    32 bit etc.
  • doubling the width of the datapath reduces the
    number of cycles required to perform a full
    32-bit operation
  • mid 80s reap benefits of this kind of
    parallelism (full 32-bit word operations combined
    with the use of caches)

10
Instruction Level Parallelism(mid 80s to mid
90s)
  • Basic steps in instruction processing
    (instruction decode, integer arithmetic, address
    calculations, could be performed in a single
    cycle)
  • Pipelined instruction processing
  • Reduced instruction set (RISC)
  • Superscalar execution
  • Branch prediction

11
Thread/Process Level Parallelism(mid 90s to
present)
  • On average control transfers occur roughly once
    in five instructions, so exploiting instruction
    level parallelism at a larger scale is not
    possible
  • Use multiple independent threads or processes
  • Concurrently running threads, processes

12
Evolution of the Infrastructure
  • Electronic Accounting Machine Era 1930-1950
  • General Purpose Mainframe and Minicomputer Era
    1959-Present
  • Personal Computer Era 1981 Present
  • Client/Server Era 1983 Present
  • Enterprise Internet Computing Era 1992- Present

13
Sequential vs Parallel Processing
  • physical limits reached
  • easy to program
  • expensive supercomputers
  • raw power unlimited
  • more memory, multiple cache
  • made up of COTS, so cheap
  • difficult to program

14
What is Multi-Core Programming ?
  • Answer It is basically parallel programming on
    a single computer box (e.g. a desktop, a
    notebook, a blade)

15
Amdahls Law
  • The serial percentage of a program is fixed. So
    speed-up obtained by employing parallel
    processing is bounded.
  • Lead to pessimism in in the parallel processing
    community and prevented development of parallel
    machines for a long time.

1
Speedup
1-s
s
P
  • In the limit
  • Spedup 1/s

s
16
Gustafsons Law
  • Serial percentage is dependent on the number of
    processors/input.
  • Broke/disproved Amdahls law.
  • Demonstrated achieving more than 1000 fold
    speedup using 1024 processors.
  • Justified parallel processing

17
Grand Challenge Applications
  • Important scientific engineering problems
    identified by U.S. High Performance Computing
    Communications Program (92)

18
Flynns Taxonomy
  • classifies computer architectures according to
  • Number of instruction streams it can process at a
    time
  • Number of data elements on which it can operate
    simultaneously

Data Streams
Single Multiple
Single
SIMD
SISD
Instruction Streams
Multiple
MISD
MIMD
19
SPMD Model (Single Program Multiple Data)
  • Each processor executes the same program
    asynchronously
  • Synchronization takes place only when processors
    need to exchange data
  • SPMD is extension of SIMD (relax synchronized
    instruction execution)
  • SPMD is restriction of MIMD (use only one
    source/object)

20
Parallel Processing Terminology
  • Embarassingly Parallel
  • applications which are trivial to parallelize
  • large amounts of independent computation
  • Little communication
  • Data Parallelism
  • model of parallel computing in which a single
    operation can be applied to all data elements
    simultaneously
  • amenable to SIMD or SPMD style of computation
  • Control Parallelism
  • many different operations may be executed
    concurrently
  • require MIMD/SPMD style of computation

21
Parallel Processing Terminology
  • Scalability
  • If the size of problem is increased, number of
    processors that can be effectively used can be
    increased (i.e. there is no limit on
    parallelism).
  • Cost of scalable algorithm grows slowly as input
    size and the number of processors are increased.
  • Data parallel algorithms are more scalable than
    control parallel alorithms
  • Granularity
  • fine grain machines employ massive number of
    weak processors each with small memory
  • coarse grain machines smaller number of powerful
    processors each with large amounts of memory

22
Shared Memory Machines
  • Memory is globally shared, therefore processes
    (threads) see single address
  • space
  • Coordination of accesses to locations done by use
    of locks provided by
  • thread libraries
  • Example Machines Sequent, Alliant, SUN Ultra,
    Dual/Quad Board Pentium PC
  • Example Thread Libraries POSIX threads, Linux
    threads.

23
Shared Memory Machines
  • can be classified as
  • UMA uniform memory access
  • NUMA nonuniform memory access
  • based on the amount of time a processor takes to
    access local and global memory.

P M P M .. P M
Inter- connection network
P M P M .. P M
Inter- connection network
M M M .. M
P P .. P
M M .. M
Inter- connection network/ or BUS
(a)
(c)
(b)
24
Distributed Memory Machines
  • Each processor has its own local memory (not
    directly accessible by others)
  • Processors communicate by passing messages to
    each other
  • Example Machines IBM SP2, Intel Paragon, COWs
    (cluster of workstations)
  • Example Message Passing Libraries PVM, MPI

25
Beowulf Clusters
  • Use COTS, ordinary PCs and networking equipment
  • Has the best price/performance ratio

PC cluster
26
Multi-Core Computing
  • A multi-core microprocessor is one which combines
    two or more independent processors into a single
    package, often a single integrated circuit.
  • A dual-core device contains only two independent
    microprocessors.

27
Comparison of Different Architectures
Single Core Architecture
28
Comparison of Different Architectures
Multiprocessor
29
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Cache
Hyper-Threading Technology
30
Comparison of Different Architectures
Multi-Core Architecture
31
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Execution unit
Cache
Multi-Core Architecture with Shared Cache
32
Comparison of Different Architectures
Multi-Core with Hyper-Threading Technology
33
(No Transcript)
34
Top 10 Most Powerful Computers in the World (as
of 6/2006)
35
Most Powerful Computers in the World (as of
11/2007)
36
Top 500 Lists
  • http//www.top500.org/list/2007/11
  • http//www.top500.org/list/2007/06
  • ..

37
Application Areas in Top 500 List
38
Top 500 Statistics
  • http//www.top500.org/stats

39
Grid Computing
  • provide access to computing power and various
    resources just like accessing electrical power
    from electrical grid
  • Allows coupling of geographically distributed
    resources
  • Provide inexpensive access to resources
    irrespective of their physical location or access
    point
  • Internet dedicated networks can be used to
    interconnect distributed computational resources
    and present them as a single unified resource
  • Resources supercomputers, clusters, storage
    systems, data resources, special devices

40
Grid Computing
  • the GRID is, in effect, a set of software tools,
    which when combined with hardware, would let
    users tap processing power off the Internet as
    easily as the electrical power can be drawn from
    the electricty grid.
  • Examples of Grids
  • -TeraGrid (USA) http//www.teragrid.org
  • -EGEE Grid (Europe) http//www.eu-egee.org/
  • TR-Grid (Turkey) http//www.grid.org.tr/
  • Sun Grid Compute Utility (Commercial,
    pay-per-use) http//www.network.com/

41
Models of Parallel Computers
  • Message Passing Model
  • Distributed memory
  • Multicomputer
  • 2. Shared Memory Model
  • Multiprocessor
  • Multi-core
  • 3. Theoretical Model
  • PRAM
  • New architectures combination of 1 and 2.

42
Theoretical PRAM Model
  • Used by parallel algorithm designers
  • Algorithm designers do not want to worry about
    low level details They want to concentrate on
    algorithmic details
  • Extends classic RAM model
  • Consist of
  • Control unit (common clock), synchronous
  • Global shared memory
  • Unbounded set of processors, each with its
    private own memory

43
Theoretical PRAM Model
  • Some characteristics
  • Each processor has a unique identifier,
    mypid0,1,2,
  • All processors operate synhronously under the
    control of a common clock
  • In each unit of time, each procesor is allowed to
    execute an instruction or stay idle

44
Various PRAM Models
weakest
(how write conflicts to the same memory location
are handled)
strongest
45
Algorithmic Performance Parameters
  • Notation

Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when
run on P processors
Time complexity of the parallel algorithm when
run on 1 processors
46
Algorithmic Performance Parameters
  • Speed-Up
  • Efficiency

47
Algorithmic Performance Parameters
  • Work Processors X Time
  • Informally How much time a parallel algorithm
    will take to simulate on a serial machine
  • Formally

48
Algorithmic Performance Parameters
  • Work Efficient
  • Informally a work efficient parallel algorithm
    does no more work than the best serial algorithm
  • Formally a work efficient algorithm satisfies

49
Algorithmic Performance Parameters
  • Scalability
  • Informally, scalability implies that if the size
    of the problem is increased, the number of
    processors effectively used can be increased
    (i.e. there is no limit on parallelism)
  • Formally, scalability means

50
Algorithmic Performance Parameters
  • Some remarks
  • Cost of scalable algorithm grows slowly as input
    size and the number of procesors are increased
  • Level of control parallelism is usually a
    constant independent of problem size
  • Level of data parallelism is an increasing
    function of problem size
  • Data parallel algorithms are more scalable than
    control parallel algorithms

51
Goals in Designing Parallel Algorithms
  • Scalability
  • Algorithm cost grows slowly, preferably in a
    polylogarithmic manner
  • Work Efficient
  • We do not want to waste CPU cycles
  • May be an important point when we are worried
    about power consumption or money paid for CPU
    usage

52
Summing N numbers in Parallel
step 1
x1..x4
result
  • Array of N numbers can be summed in log(N) steps
    using

N/2 processors
53
Prefix Summing N numbers in Parallel
step 1
step 2
x1..x4
x2..x4
x3..x6
x4..x7
x5..x8
x6..x8
x7x8
x8
step 3
x1..x8
x2..x8
x3..x8
x4..x8
x5..x8
x6..x8
x7x8
x8
  • Computing partial sums of an array of N numbers
    can be done in
  • log(N) steps using N processors

54
Prefix Paradigm for Parallel Algorithm Design
  • Prefix computation forms a paradigm for parallel
    algorithm
  • development, just like other well known
    paradigms such as
  • divide and conquer, dynamic programming, etc.
  • Prefix Paradigm
  • If possible, transform your problem to prefix
    type
  • computation
  • Apply the efficient logarithmic prefix
    computation
  • Examples of Problems solved by Prefix Paradigm
  • Solving linear recurrence equations
  • Tridiagonal Solver
  • Problems on trees
  • Adaptive triangular mesh refinement

55
Solving Linear Recurrence Equations
  • Given the linear recurrence equation
  • we can rewrite it as
  • if we expand it, we get the solution in terms of
    partial products of coefficients and the initial
    values z1 and z0
  • use prefix to compute partial products

56
Pointer Jumping Technique
step 1
step 2
step 3
  • A linked list of N numbers can be prefix-summed
    in log(N)

steps using N processors
57
Euler Tour Technique
Tree Problems
  • Preorder numbering
  • Postorder numbering
  • Number of Descendants
  • Level of each node
  • To solve such problems, first transform the tree
    by linearizing it
  • into a linked-list and then apply the prefix
    computation

58
Computing Level of Each Node by Euler Tour
Technique
-1
weight assignment
1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
level(v) pw(ltv,parent(v)gt) level(root)
0
-1
1
1
-1
w(ltu,vgt)
initial weights
-1
1
1
1
-1
1
-1
-1
-1
1
-1
1
1
-1
1
-1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
0
0
0
1
1
2
1
1
2
1
2
3
2
3
2
1
pw(ltu,vgt)
prefix
59
Computing Number of Descendants by Euler Tour
Technique
1
weight assignment
0
0
1
1
0
1
0
0
1
0
1
1
0
of descendants(v) pw(ltparent(v),vgt) -

pw(ltv,parent(v)gt) of descendants(root) n
1
0
0
1
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
60
Preorder Numbering by Euler Tour Technique
1
0
weight assignment
1
1
0
2
0
1
0
1
1
8
9
0
1
0
0
5
1
3
4
preorder(v) 1 pw(ltv,parent(v)gt) preorder(
root) 1
0
1
1
0
6
7
w(ltu,vgt)
initial weights
0
1
1
1
0
1
0
0
0
1
0
1
1
0
1
0
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
1
2
3
6
8
2
3
4
5
5
6
6
7
pw(ltu,vgt)
prefix
61
Postorder Numbering by Euler Tour Technique
9
1
weight assignment
0
0
6
1
1
0
1
0
8
0
7
1
0
1
1
0
5
2
postorder(v) pw(ltparent(v),vgt) postorder(ro
ot) n
1
1
0
0
1
3
4
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
Write a Comment
User Comments (0)
About PowerShow.com