CMPE 49B Sp. Top. in CMPE: Multi-Core Programming presentation

About This Presentation

Transcript and Presenter's Notes

Title: CMPE 49B Sp. Top. in CMPE: Multi-Core Programming

1

SWE 594 Multicore Programming
picture of Tianhe, the most powerful computer in
the world in Nov-2010
2
Von Neumann Architecture
CPU
RAM
Device
Device
BUS

sequential computer

3
Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CD
4
History of Computer Architecture

4 Generations (identified by logic technology)
Tubes
Transistors
Integrated Circuits
VLSI (very large scale integration)

5
PERFORMANCE TRENDS
6
PERFORMANCE TRENDS

Traditional mainframe/supercomputer performance
25 increase per year
But microprocessor performance 50 increase per
year since mid 80s.

7
Moores Law

Transistor density doubles every 18 months
Moore is co-founder of Intel.
60 increase per year
Exponential growth
PC costs decline.
PCs are building bricks of all future systems.

Intel 62 core xeon Phi 2012
5 billion
8
VLSI Generation
9
Bit Level Parallelism(upto mid 80s)

4 bit microprocessors replaced by 8 bit, 16 bit,
32 bit etc.
doubling the width of the datapath reduces the
number of cycles required to perform a full
32-bit operation
mid 80s reap benefits of this kind of
parallelism (full 32-bit word operations combined
with the use of caches)

10
Instruction Level Parallelism(mid 80s to mid
90s)

Basic steps in instruction processing
(instruction decode, integer arithmetic, address
calculations, could be performed in a single
cycle)
Pipelined instruction processing
Reduced instruction set (RISC)
Superscalar execution
Branch prediction

11
Thread/Process Level Parallelism(mid 90s to
present)

On average control transfers occur roughly once
in five instructions, so exploiting instruction
level parallelism at a larger scale is not
possible
Use multiple independent threads or processes
Concurrently running threads, processes

12
Evolution of the Infrastructure

Electronic Accounting Machine Era 1930-1950
General Purpose Mainframe and Minicomputer Era
1959-Present
Personal Computer Era 1981 Present
Client/Server Era 1983 Present
Enterprise Internet Computing Era 1992- Present

13
Sequential vs Parallel Processing

physical limits reached
easy to program
expensive supercomputers

raw power unlimited
more memory, multiple cache
made up of COTS, so cheap
difficult to program

14
What is Multi-Core Programming ?

Answer It is basically parallel programming on
a single computer box (e.g. a desktop, a
notebook, a blade)

15
Another Important Benefit of Multi-Core
Reduced Energy Consumption
Dual core
Single core
1 GHz
1 GHz
2 GHz
Single core executes workload of N Clock cycles
Each core executes workload of N/2 Clock cycles
2
2
Energy per cycle(E ) CVdd EnergyE N
Energy per cycle(E ) C(0.5Vdd)
0.25CVdd
Energy 2(E 0.5 N
) E N
0.25(E N) 0.25Energy
c
c
2
c
c
c
c
16
SPMD Model (Single Program Multiple Data)

Each processor executes the same program
asynchronously
Synchronization takes place only when processors
need to exchange data
SPMD is extension of SIMD (relax synchronized
instruction execution)
SPMD is restriction of MIMD (use only one
source/object)

17
Parallel Processing Terminology

Embarassingly Parallel
applications which are trivial to parallelize
large amounts of independent computation
Little communication
Data Parallelism
model of parallel computing in which a single
operation can be applied to all data elements
simultaneously
amenable to SIMD or SPMD style of computation
Control Parallelism
many different operations may be executed
concurrently
require MIMD/SPMD style of computation

18
Parallel Processing Terminology

Scalability
If the size of problem is increased, number of
processors that can be effectively used can be
increased (i.e. there is no limit on
parallelism).
Cost of scalable algorithm grows slowly as input
size and the number of processors are increased.
Data parallel algorithms are more scalable than
control parallel alorithms
Granularity
fine grain machines employ massive number of
weak processors each with small memory
coarse grain machines smaller number of powerful
processors each with large amounts of memory

19
Models of Parallel Computers

Message Passing Model
Distributed memory
Multicomputer
2. Shared Memory Model
Multiprocessor
Multi-core
3. Theoretical Model
PRAM
New architectures combination of 1 and 2.

20
Theoretical PRAM Model

Used by parallel algorithm designers
Algorithm designers do not want to worry about
low level details They want to concentrate on
algorithmic details
Extends classic RAM model
Consist of
Control unit (common clock), synchronous
Global shared memory
Unbounded set of processors, each with its
private own memory

21
Theoretical PRAM Model

Some characteristics
Each processor has a unique identifier,
mypid0,1,2,
All processors operate synhronously under the
control of a common clock
In each unit of time, each procesor is allowed to
execute an instruction or stay idle

22
Various PRAM Models
weakest
(how write conflicts to the same memory location
are handled)
strongest
23
Flynns Taxonomy

classifies computer architectures according to
Number of instruction streams it can process at a
time
Number of data elements on which it can operate
simultaneously

Data Streams
Single Multiple
Single
SIMD
SISD
Instruction Streams
Multiple
MISD
MIMD
24
Shared Memory Machines

Memory is globally shared, therefore processes
(threads) see single address
space

Coordination of accesses to locations done by use
of locks provided by
thread libraries

Example Machines Sequent, Alliant, SUN Ultra,
Dual/Quad Board Pentium PC

Example Thread Libraries POSIX threads, Linux
threads.

25
Shared Memory Machines

can be classified as
UMA uniform memory access
NUMA nonuniform memory access
based on the amount of time a processor takes to
access local and global memory.

P M P M .. P M
Inter- connection network
P M P M .. P M
Inter- connection network
M M M .. M
P P .. P
M M .. M
Inter- connection network/ or BUS
(a)
(c)
(b)
26
Distributed Memory Machines

Each processor has its own local memory (not
directly accessible by others)

Processors communicate by passing messages to
each other

Example Machines IBM SP2, Intel Paragon, COWs
(cluster of workstations)

Example Message Passing Libraries PVM, MPI

27
Beowulf Clusters

Use COTS, ordinary PCs and networking equipment
Has the best price/performance ratio

PC cluster
28
Multi-Core Computing

A multi-core microprocessor is one which combines
two or more independent processors into a single
package, often a single integrated circuit.
A dual-core device contains only two independent
microprocessors.

29
Comparison of Different Architectures
Single Core Architecture
30
Comparison of Different Architectures
Multiprocessor
31
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Cache
Hyper-Threading Technology
32
Comparison of Different Architectures
Multi-Core Architecture
33
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Execution unit
Cache
Multi-Core Architecture with Shared Cache
34
Comparison of Different Architectures
Multi-Core with Hyper-Threading Technology
35
(No Transcript)
36
Top 500 Most Power Supercomputer Lists

http//www.top500.org/
..

37
Grid Computing

provide access to computing power and various
resources just like accessing electrical power
from electrical grid
Allows coupling of geographically distributed
resources
Provide inexpensive access to resources
irrespective of their physical location or access
point
Internet dedicated networks can be used to
interconnect distributed computational resources
and present them as a single unified resource
Resources supercomputers, clusters, storage
systems, data resources, special devices

37
38
Grid Computing

the GRID is, in effect, a set of software tools,
which when combined with hardware, would let
users tap processing power off the Internet as
easily as the electrical power can be drawn from
the electricty grid.
Examples of Grids
-TeraGrid (USA)
-EGEE Grid (Europe)
TR-Grid (Turkey)

38
39
GRID COMPUTING
Power Grid
Compute Grid
40
Archeology Astronomy Astrophysics Civil
Protection Comp. Chemistry Earth
Sciences Finance Fusion Geophysics High Energy
Physics Life Sciences Multimedia Material
Sciences
gt250 sites 48 countries gt50,000 CPUs gt20
PetaBytes gt10,000 users gt150 VOs gt150,000 jobs/day
40
41
Virtualization

Virtualization is abstraction of computer
resources.
Make a single physical resource such as a server,
an operating system, an application, or storage
device appear to function as multiple logical
resources
It may also mean making multiple physical
resources such as storage devices or servers
appear as a single logical resource
Server virtualization enables companies to run
more than one operating system at the same time
on a single machine

41
42
Advantages of Virtualization

Most servers run at just 10-15 capacity
virtualization can increase server utilization to
70 or higher.
Higher utilization means fewer computers are
required to process the same amount of work.
Fewer machines means less power consumption.
Legacy applications can also be run on older
versions of an operating system
Other advantages easier administration, fault
tolerancy, security

42
43
VMware Virtual Platform
Real machines

VMware is now tens of billion dollar company !!

43
44
Cloud Computing

Style of computing in which IT-related
capabilities are provided as a
service,allowing users to access
technology-enabled services from the Internet
("in the cloud") without knowledge of,
expertise with, or control over the
technology infrastructure that supports them.
General concept that incorporates software as a
service (SaaS), Web 2.0 and
other recent, well-known technology trends, in
which the common theme is
reliance on the Internet for satisfying the
computing needs of the users.

44
45
Cloud Computing

Virtualisation provides separation between
infrastructure and user runtime environment
Users specify virtual images as their deployment
building blocks
Pay-as-you-go allows users to use the service
when they want and only pay for what they use
Elasticity of the cloud allows users to start
simple and explore more complex deployment over
time
Simple interface allows easy integration with
existing systems

45
46
Cloud Unique Features

Ease of use
REST and HTTP(S)
Runtime environment
Hardware virtualisation
Gives users full control
Elasticity
Pay-as-you-go
Cloud providers can buy hardware faster than you!

46
47
Example Cloud Amazon Web Services

EC2 (Elastic Computing Cloud) is the computing
service of Amazon
Based on hardware virtualisation
Users request virtual machine instances, pointing
to an image (public or private) stored in S3
Users have full control over each instance (e.g.
access as root, if required)
Requests can be issued via SOAP and REST

47
48
Example Cloud Amazon Web Services

Pricing information
http//aws.amazon.com/ec2/

48
49
PARALLEL PERFORMANCE MODELSand ALGORITHMS
49
50
Amdahls Law

The serial percentage of a program is fixed. So
speed-up obtained by employing parallel
processing is bounded.
Lead to pessimism in in the parallel processing
community and prevented development of parallel
machines for a long time.

1
Speedup
1-s
s
P

In the limit
Spedup 1/s

s
51
Gustafsons Law

Serial percentage is dependent on the number of
processors/input.
Demonstrated achieving more than 1000 fold
speedup using 1024 processors.
Justified parallel processing

52
Algorithmic Performance Parameters

Notation

Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when
run on P processors
Time complexity of the parallel algorithm when
run on 1 processors
53
Algorithmic Performance Parameters

Speed-Up
Efficiency

54
Algorithmic Performance Parameters

Work Processors X Time
Informally How much time a parallel algorithm
will take to simulate on a serial machine
Formally

55
Algorithmic Performance Parameters

Work Efficient
Informally a work efficient parallel algorithm
does no more work than the best serial algorithm
Formally a work efficient algorithm satisfies

56
Algorithmic Performance Parameters

Scalability
Informally, scalability implies that if the size
of the problem is increased, the number of
processors effectively used can be increased
(i.e. there is no limit on parallelism)
Formally, scalability means

57
Algorithmic Performance Parameters

Some remarks
Cost of scalable algorithm grows slowly as input
size and the number of procesors are increased
Level of control parallelism is usually a
constant independent of problem size
Level of data parallelism is an increasing
function of problem size
Data parallel algorithms are more scalable than
control parallel algorithms

58
Goals in Designing Parallel Algorithms

Scalability
Algorithm cost grows slowly, preferably in a
polylogarithmic manner
Work Efficient
We do not want to waste CPU cycles
May be an important point when we are worried
about power consumption or money paid for CPU
usage

59
Summing N numbers in Parallel
step 1
x1..x4
result

Array of N numbers can be summed in log(N) steps
using

N/2 processors
60
Prefix Summing N numbers in Parallel
step 1
step 2
x1..x4
x2..x4
x3..x6
x4..x7
x5..x8
x6..x8
x7x8
x8
step 3
x1..x8
x2..x8
x3..x8
x4..x8
x5..x8
x6..x8
x7x8
x8

Computing partial sums of an array of N numbers
can be done in
log(N) steps using N processors

61
Prefix Paradigm for Parallel Algorithm Design

Prefix computation forms a paradigm for parallel
algorithm
development, just like other well known
paradigms such as
divide and conquer, dynamic programming, etc.

Prefix Paradigm
If possible, transform your problem to prefix
type
computation
Apply the efficient logarithmic prefix
computation

Examples of Problems solved by Prefix Paradigm

Solving linear recurrence equations
Tridiagonal Solver
Problems on trees
Adaptive triangular mesh refinement

62
Solving Linear Recurrence Equations

Given the linear recurrence equation
we can rewrite it as
if we expand it, we get the solution in terms of
partial products of coefficients and the initial
values z1 and z0
use prefix to compute partial products

63
Pointer Jumping Technique
step 1
step 2
step 3

A linked list of N numbers can be prefix-summed
in log(N)

steps using N processors
64
Euler Tour Technique
Tree Problems

Preorder numbering
Postorder numbering
Number of Descendants
Level of each node

To solve such problems, first transform the tree
by linearizing it
into a linked-list and then apply the prefix
computation

65
Computing Level of Each Node by Euler Tour
Technique
-1
weight assignment
1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
level(v) pw(ltv,parent(v)gt) level(root)
0
-1
1
1
-1
w(ltu,vgt)
initial weights
-1
1
1
1
-1
1
-1
-1
-1
1
-1
1
1
-1
1
-1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
0
0
0
1
1
2
1
1
2
1
2
3
2
3
2
1
pw(ltu,vgt)
prefix
66
Computing Number of Descendants by Euler Tour
Technique
1
weight assignment
0
0
1
1
0
1
0
0
1
0
1
1
0
of descendants(v) pw(ltparent(v),vgt) -

pw(ltv,parent(v)gt) of descendants(root) n
1
0
0
1
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
67
Preorder Numbering by Euler Tour Technique
1
0
weight assignment
1
1
0
2
0
1
0
1
1
8
9
0
1
0
0
5
1
3
4
preorder(v) 1 pw(ltv,parent(v)gt) preorder(
root) 1
0
1
1
0
6
7
w(ltu,vgt)
initial weights
0
1
1
1
0
1
0
0
0
1
0
1
1
0
1
0
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
1
2
3
6
8
2
3
4
5
5
6
6
7
pw(ltu,vgt)
prefix
68
Postorder Numbering by Euler Tour Technique
9
1
weight assignment
0
0
6
1
1
0
1
0
8
0
7
1
0
1
1
0
5
2
postorder(v) pw(ltparent(v),vgt) postorder(ro
ot) n
1
1
0
0
1
3
4
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
69
Binary Tree Traversal

Preorder
Inorder
Postorder

70
Brents Theorem

Given a parallel algorithm with computation time
D, if parallel
algorithm performs W operations then P
processors can execute
the algorithm in time D (W-D)/P
For proof consider DAG representation of
computation

71
Work Efficiency

Parallel Summation
Parallel Prefix Summation

Write a Comment

User Comments (0)

About PowerShow.com

CMPE 49B Sp. Top. in CMPE: Multi-Core Programming PowerPoint PPT Presentation