Parallel Computing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel Computing

1
Parallel Computing Bioinformatics

Frank Dehne
School of Computer Science
Carleton University, Ottawa, Canada
www.dehne.net

2
Overview

Parallel Computers and Parallel Computing
Parallel Computer Examples
Parallel Programming Models
Parallel Computing in Bioinformatics
Parallel BLAST
Parallel Clustal
Parallel Minimum Vertex Cover

3
Parallel Computers and Parallel Computing
4
Memory

Sequential Computer
Processors
Parallel Computer
Interconnect
Interconnect
Shared Memory
5
1) Parallel computing for performance
One Computation
6
2) Parallel computing for throughput

SPPS serial program, parallel subsystem
Examples
Web serving
Render farms
Your average enterprise server
NCBI

Results
Dispatch
7
3) Parallel computing for dependability
Active
Standby
Interconnect
Total accumulated outages per year
8
Parallel vs. Distributed

Parallel
Tightly coupled.
In one physical location.
All systems parameters known.
Distributed
Loosely coupled.
Distributed over many locations.
System dynamic and systems parameters unknown.

9
What is a parallel algorithm ?

An algorithm designed to make use of multiple
processors
Highly dependent on the machine architecture!
No analogue to the von Neumann model

Interconnect
Interconnect
Shared Memory
10
Why Study Parallelism?

Fundamental Issues
What can be parallelized?
When is linear speedup possible?
What is the minimum of steps required to
compute X?

Practical Concerns
Computational intensive problems
Data intensive problems
Real-time constraints
Need for fault tolerance
There are a lot of cheap PCs around that you
might want to reuse ?

11
Parallel Computer Examples
12
Cray XT4
13
(No Transcript)
14
Cray X1E
15
Cray X1 Node

Cray X1 builds a larger virtual vector, called
an MSP
4 SSPs (each a 2-pipe vector processor) make up
an MSP
Compiler will (try to) vectorize/parallelize
across the MSP

custom blocks
12.8 Gflops (64 bit)
25.6 Gflops (32 bit)
25-41 GB/s
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
16
Cray X1 Parallel Vector Architecture

Cray combines several technologies in the X1
12.8 Gflop/s Vector processors (MSP)
Shared caches (unusual on earlier vector
machines)
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
Remote put/get between nodes (faster than message
passing)

17
IBM Blue Gene
18
IBM Blue Gene
19
Processor Clusters
LINUX PCs on a fast switch
64 processors GigaBit switch
20
HPCVL Cluster

CISCO 6502 switch
Redhat Linux
Sun Grid Engine Enterprise Edition Scheduler
LAM-MPI
GNU Toolchain

128 processors
21
Lab cluster

8 Intel Core-2 Duo (16 cores) with 4GB memory
each
4 machines on desks, 4 in the shelf
dedicated GigaBit switch
Linux Redhat

22
HPCVL SunFire
360 processors
23
Multi Core Processors

several processors on one chip
reaction to performance barrier mainly due to
overheating
instead of increasing clock rate, use parallelism

24
Intel Core 2 Duo
25
IBM Cell processor
26
IBM Cell processor
27
SUN UltraSPARC T1
28
SUN UltraSPARC T1

8 cores
4 hardware supported threads per core
32 hardware supported threads

29
Parallel Programming Models
30
Models of parallel computation

Historically (1970s - early 1990s), each parallel
machine was unique, along with its programming
model and language
Nowadays we separate the programming model from
the underlying machine model.
3 or 4 dominant programming models
This is still research -- HPCS study is about
comparing models
Can now write portably correct code that runs on
lots of machines
Writing portably fast code requires tuning for
the architecture
Not always worth it sometimes programmer time
is more important
Challenge design algorithms to make this tuning
easy

31
Summary of models

Programming models
Shared memory
Message passing
Data parallel

Machine models
Shared memory
Distributed memory cluster
SIMD and vectors
Hybrids

32
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
Where is the memory physically located?
33
Simple example Sum f(Ai) from i1 to in

Parallel decomposition
Each evaluation of f and each partial sum is a
task
Assign n/p numbers to each of p processes
each computes independent private results and
partial sum
one (or all) collects the p partial sums and
computes the global sum
Classes of Data
(Logically) Shared
the original n numbers, the global sum
(Logically) Private
the individual function values
what about the individual partial sums?

34
Programming Model 1 Shared Memory

Program is a collection of threads of control.
Can be created dynamically, mid-execution, in
some languages
Each thread has a set of private variables, e.g.,
local stack variables
Also a set of shared variables, e.g., static
variables, shared common blocks, or global heap.
Threads communicate implicitly by writing and
reading shared variables.
Threads coordinate by synchronizing on shared
variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
35
Shared Memory Code for Computing a Sum
static int s 0
Thread 1 for i 0, n/2-1 s s
f(Ai)
Thread 2 for i n/2, n-1 s s
f(Ai)

Problem a race condition on variable s in the
program
A race condition or data race occurs when
two processors (or two threads) access the same
variable, and at least one does a write.
The accesses are concurrent (not synchronized) so
they could happen simultaneously

36
Improved Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2

Since addition is associative, its OK to
rearrange order
Most computation is on private variables
Sharing frequency is also reduced, which might
improve speed
But there is still a race condition on the update
of shared s
The race condition can be fixed by adding locks
Only one thread can hold a lock at a time others
wait for it

37
Shared memory programming model

Mostly used for machines with small numbers of
processors.
Popular Programming Languages/Libraries
OpenMP http//www.openmp.org/,
http//www.llnl.gov/computing/tutorials/openMP/
Intel Threading Blocks http//www.threadingbuildi
ngblocks.org/

38
Programming Model 2 Message Passing

Program consists of a collection of named
processes.
Usually fixed at program startup time
Thread of control plus local address space -- NO
shared data.
Logically shared data is partitioned over local
processes.
Processes communicate by explicit send/receive
pairs
Coordination is implicit in every communication
event.

Private memory
y ..s ...
Pn
P1
P0
Network
39
Message Passing Code for Computing a Sum
Processor 1 for i 0, n/2-1 s s
f(Ai) send proc2, s receive proc2, s s s
s
Processor 2 for i n/2, n-1 s s
f(Ai) send proc1, s receive proc1, s s s s

send/receive acts like the telephone system or
post office
a deadlock occurs if the send/receive are in
different order

40
Message-passing programming model

Programming Language/Library MPI (has become
the de facto standard)
MPICH http//www-unix.mcs.anl.gov/mpi/
LAM MPI http//www.lam-mpi.org/
OpenMPI http//www.open-mpi.org/

41
Programming Model 3 Data Parallel

Single thread of control consisting of parallel
operations.
Parallel operations applied to all (or a defined
subset) of a data structure, usually an array
Communication is implicit in parallel operators
Elegant and easy to understand and reason about
Matlab and APL are sequential data-parallel
languages
MatlabP experimental data-parallel version of
Matlab
Drawbacks
Not all problems fit this model
Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
42
Vector Processors

Vector instructions operate on a vector of
elements
These are specified as operations on vector
registers
A supercomputer vector register holds 32-64 elts
The number of elements is larger than the amount
of parallel hardware, called vector pipes or
lanes, say 2-4
The hardware performs a full vector operation in
elements-per-vector-register / pipes

r1
r2

(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
43
Machine Model 4 Hybrids (Catchall Category)

Most modern high-performance machines are hybrids
of several of these categories
Cluster of shared-memory processors
Cluster of multi core processors
Cray X1 More complicated hybrid of vector,
shared-memory, and cluster
Whats the right programming model for these???

44
Parallel Computing in Bioinformatics
45
Parallel BLAST
46
Basic BLAST Algorithm

Build words find short statistically
significant sub-sequences in the query
Find seeds scan sequences in database for
matching word
Extend use (nearby) seeds to form local
alignments called HSPs
Score combine groups of consistent HSPs into
local alignment with best score

47
Parallel BLAST Shared Memory

NCBI Blast, Washington Univ. BLAST multithreading

48
Parallel BLAST Distributed Memory

BeoBLAST, Hi-per BLAST replicated database

49
Parallel BLAST Distributed Memory

mpiBLAST distributed database

50
Parallel CLUSTAL
51
Multiple Sequence Alignment
Clustal W
52
Sequential Clustal W
53
Sequential Clustal W
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
54
Parallel Clustal

Parallel pairwise (PW) alignment matrix
Parallel guide tree calculation
Parallel progressive alignment

Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
55
Parallel Clustal
56
Parallel Clustal
57
Our Parallel Clustal vs. SGI
SGI data taken from Performance Optimization of
Clustal W Parallel Clustal W, HT Clustal, and
MULTICLUSTAL By Dmitri Mikhailov, Haruna Cofer,
and Roberto Gomperts
58
Parallel Clustal Extension

Minimum Vertex Cover
remove erroneous sequences (e.g. data corrupted
by measurement error)
identify clusters of highly similar sequences
(These could be multiple measurements of the same
gene or protein sequence. Such sets corrupt
CLUSTALs progressive alignment scheme.)

59
Minimum Vertex Cover

Conflict Graph
vertex sequence
edge conflict (pairwise alignment with very poor
score or with very good score)

TASK remove smallest set of sequences that
eliminates all conflicts
NP-complete !

60
Fixed Parameter Tractability

Idea Many reduction proofs for NP-completeness
use instances that are not relevant in practice

A NP-complete problem P is fixed parameter
tractable if every instance can be characterized
by two parameters (n,k) such that P(n,k) is
solvable in time poly(n) f(k)

61
FPT Methods

Phase 1
Kernelization
Reduce problem size from (n,k) to (g(k),k)

Phase 2
Bounded Tree Search
Exhaustive tree search
time f(k)
exponential in g(k)

62
Kernelization

Buss's Algorithm for k-vertex cover
Let G(V,E) and let S be the subset of vertices
with degree k or more.
Remove S and all incident edges
G-gtG k -gt k'k-S.
IF G' has more than k x k' edges
THEN no k-vertex cover exists
ELSE start bounded tree search on G'

63
Bounded Tree Search
64
Case 1 simple path of length 3
in graph G'
search tree
v
VC...
v1
v2
VCv,v2
VCv1,v2
VCv1,v3
v3
remove selected vertices from G' k' - 2
65
Case 2 3-cycle
in graph G'
search tree
v
VC...
v1
v2
VCv,v1
VCv1,v2
VCv,v2
remove selected vertices from G' k' - 2
66
Case 3 simple path of length 2
in graph G'
search tree
v
VC...
v1
v2
VCv1
remove v1, v2 from G' k' - 1
67
Case 4 simple path of length 1
in graph G'
search tree
v
VC...
v1
VCv
remove v, v1 from G' k' - 1
68
Sequential Tree Search

Depth first search
backtrack when k'0 and G'ltgt0 ("dead end" ))
stop when solution found (G', k'gt0 )

69
Parallel Bounded Tree Search

Depth first search
backtrack when k'0 and G'ltgt0 ("dead end")
stop when solution found (G', k'gt0)

Seq. breadth first search
Parallel depth first search
...
1
2
3
p
70
Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
71
Simulation Experiment
72
Implementation

test platform
32 node HPCVL Beowulf cluster
gcc and LAM/MPI on LINUX Redhat
code-s Sequential k-vertex cover
code-p Parallel k-vertex cover

73
Test Data

n protein sequences
same protein from n species
each protein sequence a few hundred amino acid
residues in length
obtained from the National Center for
Biotechnology Information (http//www.ncbi.nlm.nih
.gov/)

74
Test Data

Somatostatin n 559, k 273, k' 255
WW n 425, k 322, k' 318
PHD n 670, k 603, k' 603
Kinase n 647, k 497, k' 397
SH2 n 730, k 461, k' 397
Thrombin n 646, k 413, k' 413

Previously not solvable
75
Sequential Times
Kinase, SH2, Thrombin n/a
76
Code-p on Virtual Proc.
77
Parallel Times
78
Speedup Somatostatin
79
Speedup WW
80
Speedup Rand. Graph (easy)
81
Speedup Grid Graph (hard)
82
Clustal XP Portal
83
Recommended Reading

A.Y. Zomaya (Ed.), Parallel Computing for
Bioinformatics and Computational Biology, Wiley,
2006
A. Grama, A. Gupta, G. Karypis, and V. Kumar,
Introduction to Parallel Computing, 2nd edition,
Addison-Wesley, 2003

Write a Comment

User Comments (0)

About PowerShow.com

Parallel Computing PowerPoint PPT Presentation