Parallel Computing - PowerPoint PPT Presentation

Loading...

PPT – Parallel Computing PowerPoint presentation | free to view - id: 1f77ff-ZWY2N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel Computing

Description:

Parallel Computer Examples. Parallel Programming Models. Parallel Computing in Bioinformatics ... 4 machines on desks, 4 in the shelf. dedicated GigaBit switch ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 84
Provided by: johnr236
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel Computing


1
Parallel Computing Bioinformatics
  • Frank Dehne
  • School of Computer Science
  • Carleton University, Ottawa, Canada
  • www.dehne.net

2
Overview
  • Parallel Computers and Parallel Computing
  • Parallel Computer Examples
  • Parallel Programming Models
  • Parallel Computing in Bioinformatics
  • Parallel BLAST
  • Parallel Clustal
  • Parallel Minimum Vertex Cover

3
Parallel Computers and Parallel Computing
4
Memory

Sequential Computer
Processors
Parallel Computer
Interconnect
Interconnect
Shared Memory
5
1) Parallel computing for performance
One Computation
6
2) Parallel computing for throughput
  • SPPS serial program, parallel subsystem
  • Examples
  • Web serving
  • Render farms
  • Your average enterprise server
  • NCBI

Results
Dispatch
7
3) Parallel computing for dependability
Active
Standby
Interconnect
Total accumulated outages per year
8
Parallel vs. Distributed
  • Parallel
  • Tightly coupled.
  • In one physical location.
  • All systems parameters known.
  • Distributed
  • Loosely coupled.
  • Distributed over many locations.
  • System dynamic and systems parameters unknown.

9
What is a parallel algorithm ?
  • An algorithm designed to make use of multiple
    processors
  • Highly dependent on the machine architecture!
  • No analogue to the von Neumann model

Interconnect
Interconnect
Shared Memory
10
Why Study Parallelism?
  • Fundamental Issues
  • What can be parallelized?
  • When is linear speedup possible?
  • What is the minimum of steps required to
    compute X?
  • Practical Concerns
  • Computational intensive problems
  • Data intensive problems
  • Real-time constraints
  • Need for fault tolerance
  • There are a lot of cheap PCs around that you
    might want to reuse ?

11
Parallel Computer Examples
12
Cray XT4
13
(No Transcript)
14
Cray X1E
15
Cray X1 Node
  • Cray X1 builds a larger virtual vector, called
    an MSP
  • 4 SSPs (each a 2-pipe vector processor) make up
    an MSP
  • Compiler will (try to) vectorize/parallelize
    across the MSP

custom blocks
12.8 Gflops (64 bit)
25.6 Gflops (32 bit)
25-41 GB/s
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
16
Cray X1 Parallel Vector Architecture
  • Cray combines several technologies in the X1
  • 12.8 Gflop/s Vector processors (MSP)
  • Shared caches (unusual on earlier vector
    machines)
  • 4 processor nodes sharing up to 64 GB of memory
  • Single System Image to 4096 Processors
  • Remote put/get between nodes (faster than message
    passing)

17
IBM Blue Gene
18
IBM Blue Gene
19
Processor Clusters
LINUX PCs on a fast switch
64 processors GigaBit switch
20
HPCVL Cluster
  • CISCO 6502 switch
  • Redhat Linux
  • Sun Grid Engine Enterprise Edition Scheduler
  • LAM-MPI
  • GNU Toolchain

128 processors
21
Lab cluster
  • 8 Intel Core-2 Duo (16 cores) with 4GB memory
    each
  • 4 machines on desks, 4 in the shelf
  • dedicated GigaBit switch
  • Linux Redhat

22
HPCVL SunFire
360 processors
23
Multi Core Processors
  • several processors on one chip
  • reaction to performance barrier mainly due to
    overheating
  • instead of increasing clock rate, use parallelism

24
Intel Core 2 Duo
25
IBM Cell processor
26
IBM Cell processor
27
SUN UltraSPARC T1
28
SUN UltraSPARC T1
  • 8 cores
  • 4 hardware supported threads per core
  • 32 hardware supported threads

29
Parallel Programming Models
30
Models of parallel computation
  • Historically (1970s - early 1990s), each parallel
    machine was unique, along with its programming
    model and language
  • Nowadays we separate the programming model from
    the underlying machine model.
  • 3 or 4 dominant programming models
  • This is still research -- HPCS study is about
    comparing models
  • Can now write portably correct code that runs on
    lots of machines
  • Writing portably fast code requires tuning for
    the architecture
  • Not always worth it sometimes programmer time
    is more important
  • Challenge design algorithms to make this tuning
    easy

31
Summary of models
  • Programming models
  • Shared memory
  • Message passing
  • Data parallel
  • Machine models
  • Shared memory
  • Distributed memory cluster
  • SIMD and vectors
  • Hybrids

32
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
Where is the memory physically located?
33
Simple example Sum f(Ai) from i1 to in
  • Parallel decomposition
  • Each evaluation of f and each partial sum is a
    task
  • Assign n/p numbers to each of p processes
  • each computes independent private results and
    partial sum
  • one (or all) collects the p partial sums and
    computes the global sum
  • Classes of Data
  • (Logically) Shared
  • the original n numbers, the global sum
  • (Logically) Private
  • the individual function values
  • what about the individual partial sums?

34
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Can be created dynamically, mid-execution, in
    some languages
  • Each thread has a set of private variables, e.g.,
    local stack variables
  • Also a set of shared variables, e.g., static
    variables, shared common blocks, or global heap.
  • Threads communicate implicitly by writing and
    reading shared variables.
  • Threads coordinate by synchronizing on shared
    variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
35
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 for i 0, n/2-1 s s
f(Ai)
Thread 2 for i n/2, n-1 s s
f(Ai)
  • Problem a race condition on variable s in the
    program
  • A race condition or data race occurs when
  • two processors (or two threads) access the same
    variable, and at least one does a write.
  • The accesses are concurrent (not synchronized) so
    they could happen simultaneously

36
Improved Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
  • Since addition is associative, its OK to
    rearrange order
  • Most computation is on private variables
  • Sharing frequency is also reduced, which might
    improve speed
  • But there is still a race condition on the update
    of shared s
  • The race condition can be fixed by adding locks
  • Only one thread can hold a lock at a time others
    wait for it

37
Shared memory programming model
  • Mostly used for machines with small numbers of
    processors.
  • Popular Programming Languages/Libraries
  • OpenMP http//www.openmp.org/,
    http//www.llnl.gov/computing/tutorials/openMP/
  • Intel Threading Blocks http//www.threadingbuildi
    ngblocks.org/

38
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.

Private memory
y ..s ...
Pn
P1
P0
Network
39
Message Passing Code for Computing a Sum
Processor 1 for i 0, n/2-1 s s
f(Ai) send proc2, s receive proc2, s s s
s
Processor 2 for i n/2, n-1 s s
f(Ai) send proc1, s receive proc1, s s s s
  • send/receive acts like the telephone system or
    post office
  • a deadlock occurs if the send/receive are in
    different order

40
Message-passing programming model
  • Programming Language/Library MPI (has become
    the de facto standard)
  • MPICH http//www-unix.mcs.anl.gov/mpi/
  • LAM MPI http//www.lam-mpi.org/
  • OpenMPI http//www.open-mpi.org/

41
Programming Model 3 Data Parallel
  • Single thread of control consisting of parallel
    operations.
  • Parallel operations applied to all (or a defined
    subset) of a data structure, usually an array
  • Communication is implicit in parallel operators
  • Elegant and easy to understand and reason about
  • Matlab and APL are sequential data-parallel
    languages
  • MatlabP experimental data-parallel version of
    Matlab
  • Drawbacks
  • Not all problems fit this model
  • Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
42
Vector Processors
  • Vector instructions operate on a vector of
    elements
  • These are specified as operations on vector
    registers
  • A supercomputer vector register holds 32-64 elts
  • The number of elements is larger than the amount
    of parallel hardware, called vector pipes or
    lanes, say 2-4
  • The hardware performs a full vector operation in
  • elements-per-vector-register / pipes

r1
r2

(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
43
Machine Model 4 Hybrids (Catchall Category)
  • Most modern high-performance machines are hybrids
    of several of these categories
  • Cluster of shared-memory processors
  • Cluster of multi core processors
  • Cray X1 More complicated hybrid of vector,
    shared-memory, and cluster
  • Whats the right programming model for these???

44
Parallel Computing in Bioinformatics
45
Parallel BLAST
46
Basic BLAST Algorithm
  • Build words find short statistically
    significant sub-sequences in the query
  • Find seeds scan sequences in database for
    matching word
  • Extend use (nearby) seeds to form local
    alignments called HSPs
  • Score combine groups of consistent HSPs into
    local alignment with best score

47
Parallel BLAST Shared Memory
  • NCBI Blast, Washington Univ. BLAST multithreading

48
Parallel BLAST Distributed Memory
  • BeoBLAST, Hi-per BLAST replicated database

49
Parallel BLAST Distributed Memory
  • mpiBLAST distributed database

50
Parallel CLUSTAL
51
Multiple Sequence Alignment
Clustal W
52
Sequential Clustal W
53
Sequential Clustal W
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
54
Parallel Clustal
  • Parallel pairwise (PW) alignment matrix
  • Parallel guide tree calculation
  • Parallel progressive alignment

Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
55
Parallel Clustal
56
Parallel Clustal
57
Our Parallel Clustal vs. SGI
SGI data taken from Performance Optimization of
Clustal W Parallel Clustal W, HT Clustal, and
MULTICLUSTAL By Dmitri Mikhailov, Haruna Cofer,
and Roberto Gomperts
58
Parallel Clustal Extension
  • Minimum Vertex Cover
  • remove erroneous sequences (e.g. data corrupted
    by measurement error)
  • identify clusters of highly similar sequences
    (These could be multiple measurements of the same
    gene or protein sequence. Such sets corrupt
    CLUSTALs progressive alignment scheme.)

59
Minimum Vertex Cover
  • Conflict Graph
  • vertex sequence
  • edge conflict (pairwise alignment with very poor
    score or with very good score)
  • TASK remove smallest set of sequences that
    eliminates all conflicts
  • NP-complete !

60
Fixed Parameter Tractability
  • Idea Many reduction proofs for NP-completeness
    use instances that are not relevant in practice
  • A NP-complete problem P is fixed parameter
    tractable if every instance can be characterized
    by two parameters (n,k) such that P(n,k) is
    solvable in time poly(n) f(k)

61
FPT Methods
  • Phase 1
  • Kernelization
  • Reduce problem size from (n,k) to (g(k),k)
  • Phase 2
  • Bounded Tree Search
  • Exhaustive tree search
  • time f(k)
  • exponential in g(k)

62
Kernelization
  • Buss's Algorithm for k-vertex cover
  • Let G(V,E) and let S be the subset of vertices
    with degree k or more.
  • Remove S and all incident edges
  • G-gtG k -gt k'k-S.
  • IF G' has more than k x k' edges
  • THEN no k-vertex cover exists
  • ELSE start bounded tree search on G'

63
Bounded Tree Search
64
Case 1 simple path of length 3
in graph G'
search tree
v
VC...
v1
v2
VCv,v2
VCv1,v2
VCv1,v3
v3
remove selected vertices from G' k' - 2
65
Case 2 3-cycle
in graph G'
search tree
v
VC...
v1
v2
VCv,v1
VCv1,v2
VCv,v2
remove selected vertices from G' k' - 2
66
Case 3 simple path of length 2
in graph G'
search tree
v
VC...
v1
v2
VCv1
remove v1, v2 from G' k' - 1
67
Case 4 simple path of length 1
in graph G'
search tree
v
VC...
v1
VCv
remove v, v1 from G' k' - 1
68
Sequential Tree Search
  • Depth first search
  • backtrack when k'0 and G'ltgt0 ("dead end" ))
  • stop when solution found (G', k'gt0 )

69
Parallel Bounded Tree Search
  • Depth first search
  • backtrack when k'0 and G'ltgt0 ("dead end")
  • stop when solution found (G', k'gt0)

Seq. breadth first search
Parallel depth first search
...
1
2
3
p
70
Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
71
Simulation Experiment
72
Implementation
  • test platform
  • 32 node HPCVL Beowulf cluster
  • gcc and LAM/MPI on LINUX Redhat
  • code-s Sequential k-vertex cover
  • code-p Parallel k-vertex cover

73
Test Data
  • n protein sequences
  • same protein from n species
  • each protein sequence a few hundred amino acid
    residues in length
  • obtained from the National Center for
    Biotechnology Information (http//www.ncbi.nlm.nih
    .gov/)

74
Test Data
  • Somatostatin n 559, k 273, k' 255
  • WW n 425, k 322, k' 318
  • PHD n 670, k 603, k' 603
  • Kinase n 647, k 497, k' 397
  • SH2 n 730, k 461, k' 397
  • Thrombin n 646, k 413, k' 413

Previously not solvable
75
Sequential Times
Kinase, SH2, Thrombin n/a
76
Code-p on Virtual Proc.
77
Parallel Times
78
Speedup Somatostatin
79
Speedup WW
80
Speedup Rand. Graph (easy)
81
Speedup Grid Graph (hard)
82
Clustal XP Portal
83
Recommended Reading
  • A.Y. Zomaya (Ed.), Parallel Computing for
    Bioinformatics and Computational Biology, Wiley,
    2006
  • A. Grama, A. Gupta, G. Karypis, and V. Kumar,
    Introduction to Parallel Computing, 2nd edition,
    Addison-Wesley, 2003
About PowerShow.com