CMPE 49B Sp. Top. in CMPE: Multi-Core Programming PowerPoint PPT Presentation

presentation player overlay
1 / 71
About This Presentation
Transcript and Presenter's Notes

Title: CMPE 49B Sp. Top. in CMPE: Multi-Core Programming


1

SWE 594 Multicore Programming
picture of Tianhe, the most powerful computer in
the world in Nov-2010
2
Von Neumann Architecture
CPU
RAM
Device
Device
BUS
  • sequential computer

3
Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CD
4
History of Computer Architecture
  • 4 Generations (identified by logic technology)
  • Tubes
  • Transistors
  • Integrated Circuits
  • VLSI (very large scale integration)

5
PERFORMANCE TRENDS
6
PERFORMANCE TRENDS
  • Traditional mainframe/supercomputer performance
    25 increase per year
  • But microprocessor performance 50 increase per
    year since mid 80s.

7
Moores Law
  • Transistor density doubles every 18 months
  • Moore is co-founder of Intel.
  • 60 increase per year
  • Exponential growth
  • PC costs decline.
  • PCs are building bricks of all future systems.

Intel 62 core xeon Phi 2012
5 billion
8
VLSI Generation
9
Bit Level Parallelism(upto mid 80s)
  • 4 bit microprocessors replaced by 8 bit, 16 bit,
    32 bit etc.
  • doubling the width of the datapath reduces the
    number of cycles required to perform a full
    32-bit operation
  • mid 80s reap benefits of this kind of
    parallelism (full 32-bit word operations combined
    with the use of caches)

10
Instruction Level Parallelism(mid 80s to mid
90s)
  • Basic steps in instruction processing
    (instruction decode, integer arithmetic, address
    calculations, could be performed in a single
    cycle)
  • Pipelined instruction processing
  • Reduced instruction set (RISC)
  • Superscalar execution
  • Branch prediction

11
Thread/Process Level Parallelism(mid 90s to
present)
  • On average control transfers occur roughly once
    in five instructions, so exploiting instruction
    level parallelism at a larger scale is not
    possible
  • Use multiple independent threads or processes
  • Concurrently running threads, processes

12
Evolution of the Infrastructure
  • Electronic Accounting Machine Era 1930-1950
  • General Purpose Mainframe and Minicomputer Era
    1959-Present
  • Personal Computer Era 1981 Present
  • Client/Server Era 1983 Present
  • Enterprise Internet Computing Era 1992- Present

13
Sequential vs Parallel Processing
  • physical limits reached
  • easy to program
  • expensive supercomputers
  • raw power unlimited
  • more memory, multiple cache
  • made up of COTS, so cheap
  • difficult to program

14
What is Multi-Core Programming ?
  • Answer It is basically parallel programming on
    a single computer box (e.g. a desktop, a
    notebook, a blade)

15
Another Important Benefit of Multi-Core
Reduced Energy Consumption
Dual core
Single core
1 GHz
1 GHz
2 GHz
Single core executes workload of N Clock cycles
Each core executes workload of N/2 Clock cycles
2
2
Energy per cycle(E ) CVdd EnergyE N
Energy per cycle(E ) C(0.5Vdd)
0.25CVdd
Energy 2(E 0.5 N
) E N
0.25(E N) 0.25Energy
c
c
2
c
c
c
c
16
SPMD Model (Single Program Multiple Data)
  • Each processor executes the same program
    asynchronously
  • Synchronization takes place only when processors
    need to exchange data
  • SPMD is extension of SIMD (relax synchronized
    instruction execution)
  • SPMD is restriction of MIMD (use only one
    source/object)

17
Parallel Processing Terminology
  • Embarassingly Parallel
  • applications which are trivial to parallelize
  • large amounts of independent computation
  • Little communication
  • Data Parallelism
  • model of parallel computing in which a single
    operation can be applied to all data elements
    simultaneously
  • amenable to SIMD or SPMD style of computation
  • Control Parallelism
  • many different operations may be executed
    concurrently
  • require MIMD/SPMD style of computation

18
Parallel Processing Terminology
  • Scalability
  • If the size of problem is increased, number of
    processors that can be effectively used can be
    increased (i.e. there is no limit on
    parallelism).
  • Cost of scalable algorithm grows slowly as input
    size and the number of processors are increased.
  • Data parallel algorithms are more scalable than
    control parallel alorithms
  • Granularity
  • fine grain machines employ massive number of
    weak processors each with small memory
  • coarse grain machines smaller number of powerful
    processors each with large amounts of memory

19
Models of Parallel Computers
  • Message Passing Model
  • Distributed memory
  • Multicomputer
  • 2. Shared Memory Model
  • Multiprocessor
  • Multi-core
  • 3. Theoretical Model
  • PRAM
  • New architectures combination of 1 and 2.

20
Theoretical PRAM Model
  • Used by parallel algorithm designers
  • Algorithm designers do not want to worry about
    low level details They want to concentrate on
    algorithmic details
  • Extends classic RAM model
  • Consist of
  • Control unit (common clock), synchronous
  • Global shared memory
  • Unbounded set of processors, each with its
    private own memory

21
Theoretical PRAM Model
  • Some characteristics
  • Each processor has a unique identifier,
    mypid0,1,2,
  • All processors operate synhronously under the
    control of a common clock
  • In each unit of time, each procesor is allowed to
    execute an instruction or stay idle

22
Various PRAM Models
weakest
(how write conflicts to the same memory location
are handled)
strongest
23
Flynns Taxonomy
  • classifies computer architectures according to
  • Number of instruction streams it can process at a
    time
  • Number of data elements on which it can operate
    simultaneously

Data Streams
Single Multiple
Single
SIMD
SISD
Instruction Streams
Multiple
MISD
MIMD
24
Shared Memory Machines
  • Memory is globally shared, therefore processes
    (threads) see single address
  • space
  • Coordination of accesses to locations done by use
    of locks provided by
  • thread libraries
  • Example Machines Sequent, Alliant, SUN Ultra,
    Dual/Quad Board Pentium PC
  • Example Thread Libraries POSIX threads, Linux
    threads.

25
Shared Memory Machines
  • can be classified as
  • UMA uniform memory access
  • NUMA nonuniform memory access
  • based on the amount of time a processor takes to
    access local and global memory.

P M P M .. P M
Inter- connection network
P M P M .. P M
Inter- connection network
M M M .. M
P P .. P
M M .. M
Inter- connection network/ or BUS
(a)
(c)
(b)
26
Distributed Memory Machines
  • Each processor has its own local memory (not
    directly accessible by others)
  • Processors communicate by passing messages to
    each other
  • Example Machines IBM SP2, Intel Paragon, COWs
    (cluster of workstations)
  • Example Message Passing Libraries PVM, MPI

27
Beowulf Clusters
  • Use COTS, ordinary PCs and networking equipment
  • Has the best price/performance ratio

PC cluster
28
Multi-Core Computing
  • A multi-core microprocessor is one which combines
    two or more independent processors into a single
    package, often a single integrated circuit.
  • A dual-core device contains only two independent
    microprocessors.

29
Comparison of Different Architectures
Single Core Architecture
30
Comparison of Different Architectures
Multiprocessor
31
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Cache
Hyper-Threading Technology
32
Comparison of Different Architectures
Multi-Core Architecture
33
Comparison of Different Architectures
CPU State
CPU State
Execution unit
Execution unit
Cache
Multi-Core Architecture with Shared Cache
34
Comparison of Different Architectures
Multi-Core with Hyper-Threading Technology
35
(No Transcript)
36
Top 500 Most Power Supercomputer Lists
  • http//www.top500.org/
  • ..

37
Grid Computing
  • provide access to computing power and various
    resources just like accessing electrical power
    from electrical grid
  • Allows coupling of geographically distributed
    resources
  • Provide inexpensive access to resources
    irrespective of their physical location or access
    point
  • Internet dedicated networks can be used to
    interconnect distributed computational resources
    and present them as a single unified resource
  • Resources supercomputers, clusters, storage
    systems, data resources, special devices

37
38
Grid Computing
  • the GRID is, in effect, a set of software tools,
    which when combined with hardware, would let
    users tap processing power off the Internet as
    easily as the electrical power can be drawn from
    the electricty grid.
  • Examples of Grids
  • -TeraGrid (USA)
  • -EGEE Grid (Europe)
  • TR-Grid (Turkey)

38
39
GRID COMPUTING
Power Grid
Compute Grid
40
Archeology Astronomy Astrophysics Civil
Protection Comp. Chemistry Earth
Sciences Finance Fusion Geophysics High Energy
Physics Life Sciences Multimedia Material
Sciences
gt250 sites 48 countries gt50,000 CPUs gt20
PetaBytes gt10,000 users gt150 VOs gt150,000 jobs/day
40
41
Virtualization
  • Virtualization is abstraction of computer
    resources.
  • Make a single physical resource such as a server,
    an operating system, an application, or storage
    device appear to function as multiple logical
    resources
  • It may also mean making multiple physical
    resources such as storage devices or servers
    appear as a single logical resource
  • Server virtualization enables companies to run
    more than one operating system at the same time
    on a single machine

41
42
Advantages of Virtualization
  • Most servers run at just 10-15 capacity
    virtualization can increase server utilization to
    70 or higher.
  • Higher utilization means fewer computers are
    required to process the same amount of work.
    Fewer machines means less power consumption.
  • Legacy applications can also be run on older
    versions of an operating system
  • Other advantages easier administration, fault
    tolerancy, security

42
43
VMware Virtual Platform
Real machines
  • VMware is now tens of billion dollar company !!

43
44
Cloud Computing
  • Style of computing in which IT-related
    capabilities are provided as a
  • service,allowing users to access
    technology-enabled services from the Internet
  • ("in the cloud") without knowledge of,
    expertise with, or control over the
  • technology infrastructure that supports them.
  • General concept that incorporates software as a
    service (SaaS), Web 2.0 and
  • other recent, well-known technology trends, in
    which the common theme is
  • reliance on the Internet for satisfying the
    computing needs of the users.

44
45
Cloud Computing
  • Virtualisation provides separation between
    infrastructure and user runtime environment
  • Users specify virtual images as their deployment
    building blocks
  • Pay-as-you-go allows users to use the service
    when they want and only pay for what they use
  • Elasticity of the cloud allows users to start
    simple and explore more complex deployment over
    time
  • Simple interface allows easy integration with
    existing systems

45
46
Cloud Unique Features
  • Ease of use
  • REST and HTTP(S)
  • Runtime environment
  • Hardware virtualisation
  • Gives users full control
  • Elasticity
  • Pay-as-you-go
  • Cloud providers can buy hardware faster than you!

46
47
Example Cloud Amazon Web Services
  • EC2 (Elastic Computing Cloud) is the computing
    service of Amazon
  • Based on hardware virtualisation
  • Users request virtual machine instances, pointing
    to an image (public or private) stored in S3
  • Users have full control over each instance (e.g.
    access as root, if required)
  • Requests can be issued via SOAP and REST

47
48
Example Cloud Amazon Web Services
  • Pricing information
  • http//aws.amazon.com/ec2/

48
49
PARALLEL PERFORMANCE MODELSand ALGORITHMS
49
50
Amdahls Law
  • The serial percentage of a program is fixed. So
    speed-up obtained by employing parallel
    processing is bounded.
  • Lead to pessimism in in the parallel processing
    community and prevented development of parallel
    machines for a long time.

1
Speedup
1-s
s
P
  • In the limit
  • Spedup 1/s

s
51
Gustafsons Law
  • Serial percentage is dependent on the number of
    processors/input.
  • Demonstrated achieving more than 1000 fold
    speedup using 1024 processors.
  • Justified parallel processing

52
Algorithmic Performance Parameters
  • Notation

Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when
run on P processors
Time complexity of the parallel algorithm when
run on 1 processors
53
Algorithmic Performance Parameters
  • Speed-Up
  • Efficiency

54
Algorithmic Performance Parameters
  • Work Processors X Time
  • Informally How much time a parallel algorithm
    will take to simulate on a serial machine
  • Formally

55
Algorithmic Performance Parameters
  • Work Efficient
  • Informally a work efficient parallel algorithm
    does no more work than the best serial algorithm
  • Formally a work efficient algorithm satisfies

56
Algorithmic Performance Parameters
  • Scalability
  • Informally, scalability implies that if the size
    of the problem is increased, the number of
    processors effectively used can be increased
    (i.e. there is no limit on parallelism)
  • Formally, scalability means

57
Algorithmic Performance Parameters
  • Some remarks
  • Cost of scalable algorithm grows slowly as input
    size and the number of procesors are increased
  • Level of control parallelism is usually a
    constant independent of problem size
  • Level of data parallelism is an increasing
    function of problem size
  • Data parallel algorithms are more scalable than
    control parallel algorithms

58
Goals in Designing Parallel Algorithms
  • Scalability
  • Algorithm cost grows slowly, preferably in a
    polylogarithmic manner
  • Work Efficient
  • We do not want to waste CPU cycles
  • May be an important point when we are worried
    about power consumption or money paid for CPU
    usage

59
Summing N numbers in Parallel
step 1
x1..x4
result
  • Array of N numbers can be summed in log(N) steps
    using

N/2 processors
60
Prefix Summing N numbers in Parallel
step 1
step 2
x1..x4
x2..x4
x3..x6
x4..x7
x5..x8
x6..x8
x7x8
x8
step 3
x1..x8
x2..x8
x3..x8
x4..x8
x5..x8
x6..x8
x7x8
x8
  • Computing partial sums of an array of N numbers
    can be done in
  • log(N) steps using N processors

61
Prefix Paradigm for Parallel Algorithm Design
  • Prefix computation forms a paradigm for parallel
    algorithm
  • development, just like other well known
    paradigms such as
  • divide and conquer, dynamic programming, etc.
  • Prefix Paradigm
  • If possible, transform your problem to prefix
    type
  • computation
  • Apply the efficient logarithmic prefix
    computation
  • Examples of Problems solved by Prefix Paradigm
  • Solving linear recurrence equations
  • Tridiagonal Solver
  • Problems on trees
  • Adaptive triangular mesh refinement

62
Solving Linear Recurrence Equations
  • Given the linear recurrence equation
  • we can rewrite it as
  • if we expand it, we get the solution in terms of
    partial products of coefficients and the initial
    values z1 and z0
  • use prefix to compute partial products

63
Pointer Jumping Technique
step 1
step 2
step 3
  • A linked list of N numbers can be prefix-summed
    in log(N)

steps using N processors
64
Euler Tour Technique
Tree Problems
  • Preorder numbering
  • Postorder numbering
  • Number of Descendants
  • Level of each node
  • To solve such problems, first transform the tree
    by linearizing it
  • into a linked-list and then apply the prefix
    computation

65
Computing Level of Each Node by Euler Tour
Technique
-1
weight assignment
1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
level(v) pw(ltv,parent(v)gt) level(root)
0
-1
1
1
-1
w(ltu,vgt)
initial weights
-1
1
1
1
-1
1
-1
-1
-1
1
-1
1
1
-1
1
-1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
0
0
0
1
1
2
1
1
2
1
2
3
2
3
2
1
pw(ltu,vgt)
prefix
66
Computing Number of Descendants by Euler Tour
Technique
1
weight assignment
0
0
1
1
0
1
0
0
1
0
1
1
0
of descendants(v) pw(ltparent(v),vgt) -

pw(ltv,parent(v)gt) of descendants(root) n
1
0
0
1
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
67
Preorder Numbering by Euler Tour Technique
1
0
weight assignment
1
1
0
2
0
1
0
1
1
8
9
0
1
0
0
5
1
3
4
preorder(v) 1 pw(ltv,parent(v)gt) preorder(
root) 1
0
1
1
0
6
7
w(ltu,vgt)
initial weights
0
1
1
1
0
1
0
0
0
1
0
1
1
0
1
0
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
1
2
3
6
8
2
3
4
5
5
6
6
7
pw(ltu,vgt)
prefix
68
Postorder Numbering by Euler Tour Technique
9
1
weight assignment
0
0
6
1
1
0
1
0
8
0
7
1
0
1
1
0
5
2
postorder(v) pw(ltparent(v),vgt) postorder(ro
ot) n
1
1
0
0
1
3
4
w(ltu,vgt)
initial weights
1
0
0
0
1
0
1
1
1
0
1
0
0
1
0
1
a
g
g
a
g
a
c
e
a
i
b
d
h
b
f
b
b
6
7
8
0
1
1
5
7
0
2
2
2
3
3
4
6
pw(ltu,vgt)
prefix
69
Binary Tree Traversal
  • Preorder
  • Inorder
  • Postorder

70
Brents Theorem
  • Given a parallel algorithm with computation time
    D, if parallel
  • algorithm performs W operations then P
    processors can execute
  • the algorithm in time D (W-D)/P
  • For proof consider DAG representation of
    computation

71
Work Efficiency
  • Parallel Summation
  • Parallel Prefix Summation
Write a Comment
User Comments (0)
About PowerShow.com