Parallel Computational Biochemistry - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Parallel Computational Biochemistry

Description:

Parallel Computational Biochemistry Multiple Sequence Alignment Databases of Biological Sequences Sequence comparison Compare one sequence (target) to many sequences ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 52

Provided by: peopleScs3

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Computational Biochemistry

1
Parallel Computational Biochemistry
2
Proteins, DNA, etc.
DNA encodes the information necessary to
produce proteins
Proteins are the main molecular building blocks
of life (for example, structural proteins,
enzymes)
3
Proteins, DNA, etc.

Proteins are formed from a chain of molecules
called amino acids

4
Proteins, DNA, etc.

The DNA sequence encodes the amino acid sequence
that constitutes the protein

5
Proteins, DNA, etc.

There are twenty amino acids found in proteins,
denoted by A, C, D, E, F, G, H, I, ...

6
Multiple Sequence Alignment
7
Databases of Biological Sequences
NCBI 14,976,310 sequences 15,849,921,438
nucleotides
gtBGAL_SULSO BETA-GALACTOSIDASE Sulfolobus
solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKW
VHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWS
RIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIF
KDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEF
ARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELS
RRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMA
ENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRT
EKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRY
HLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLA
DNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEH
LNSVPPVKPLRH
Swiss-Prot 104,559 sequences
38,460,707 residues
PDB 17,175 structures
8
Sequence comparison

Compare one sequence (target) to many sequences
(database search)
Compare more than two sequences simultaneously

9
Applications

Phylogenetic analysis
Identification of conserved motifs and domains
Structure prediction

10
(No Transcript)
11
Phylogenetic Analysis
12
Structure Prediction
gt RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPN
TDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIAR
LNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALN
HYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLST
RTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGY
LSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDM
EAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTR
TVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLT
KYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGY
LHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAI
TDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Genomic sequences
13
Our Contributions

Parallel min vertex cover for improved sequence
alignments
(to appear in Journal of Computer and System
Sciences)
Parallel Clustal W (ICCSA 2003)
In progress Clustal XP portal at
http//cgm.dehne.net

14
Clustal W
15
Progressive Alignment
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
16
Parallel Clustal

Parallel pairwise (PW) alignment matrix
Parallel guide tree calculation
Parallel progressive alignment

Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
17
Relative Speedup
18
Clustal XP vs. SGI

SGI data taken from Performance Optimization of
Clustal W Parallel Clustal W, HT Clustal, and
MULTICLUSTAL
By Dmitri Mikhailov, Haruna Cofer, and Roberto
Gomperts

19
Parallel Clustal - Improvements

Optimization of input parameters
scoring matrices, gap penalties - requires many
repetitive Clustal W calculations with various
input parameters.
Minimum Vertex Cover
use minimum vertex cover to remove erroneous
sequences, and identify clusters of highly
similar sequences.

20
Minimum Vertex Cover

TASK remove smallest number of gene sequences
that eliminates all conflicts
NP-complete

Conflict Graph
vertex sequence
edge conflict (e.g. alignment with very poor
score)

21
FPT Algorithms

Phase 1 Kernelization
Reduce problem to size f(k)

Phase 2 Bounded Tree Search
Exhausive tree search exponential in f(k)

22
Kernelization

Buss's Algorithm for k-vertex cover
Let G(V,E) and let S be the subset of vertices
with degree k or more.
Remove S and all incident edges
G-gtG k -gt k'k-S.
IF G' has more than k x k' edges
THEN no k-vertex cover exists
ELSE start bounded tree search on G'

23
Bounded Tree Search
24
Case 1 simple path of length 3
remove selected vertices from G' k' - 2
25
Case 2 3-cycle
remove selected vertices from G' k' - 2
26
Case 3 simple path of length 2
remove v1, v2 from G' k' - 1
27
Case 4 simple path of length 1
remove v, v1 from G' k' - 1
28
Sequential Tree Search

Depth first search
backtrack when k'0 and G'ltgt0 ("dead end" ))
stop when solution found (G', k'gt0 )

29
Parallel Tree Search

Basic Idea
Build top log p levels of the search tree (T ')
every proc. starts depth-first search at one leaf
of T '
randomize depth-first search by selecting random
child

30
Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
31
Simulation Experiment
L 1,000,000
32
Implementation

test platform
32 node HPCVL Beowulf cluster
each node dual 1.4 GHz Intel Xeon, 512 MB RAM,
60 GB disk
gcc and LAM/MPI on LINUX Redhat 7.2
code-s Sequential k-vertex cover
code-p Parallel k-vertex cover

33
Test Data

Protein sequences
Same protein from several hundred species
Each protein sequence a few hundred amino acid
residues in length
Obtained from the National Center for
Biotechnology Information (http//www.ncbi.nlm.nih
.gov/)

34
Test Data

Somatostatin
neuropeptide involved in the regulation of many
functions in different organ systems
Clustal Threshold 10, V 559, E 33652, k
273, k' 255

35
Test Data

WW
small protein domain that binds proline rich
sequences in other proteins and is involved in
cellular signaling
Clustal Threshold 10, V 425, E 40182, k
322, k' 318

36
Test Data

Kinase
large family of enzymes involved in cellular
regulation
Clustal Threshold 16, V 647, E 113122,
k 497, k' 397

37
Test Data

SH2 (src-homology domain 2)
involved in targeting proteins to specific sites
in cells by binding to phosphor-tyrosine
Clustal Threshold 10, V 730, E 95463, k
461, k' 397

38
Test Data

Thrombin
protease involved in the blood coagulation
cascade and promotes blood clotting by converting
fibrinogen to fibrin
Clustal Threshold 15, V 646, E 62731, k
413, k' 413

39
Test Data

PHD (pleckstrin homology domain)
involved in cellular signaling
Clustal Threshold 10, V 670, E 147054,
k 603, k' 603

40
Test Data

Random Graph
V 220, E 2155, k 122, k' 122
Grid Graph
V 289, E 544, k 145, k' 145

41
Test Data

VC V / 2 k' k

42
Sequential Times
Kinase, SH2, Thombin n/a
43
Code-p on Virtual Proc.
44
Parallel Times
45
Speedup Somatostatin
46
Speedup WW
47
Speedup Rand. Graph
48
Speedup Grid Graph
49
Clustal XP
X Extended P Parallel
in progress
50
Clustal XP
http//cgm.dehne.net
51
(No Transcript)

Write a Comment

User Comments (0)