Parallel Computational Biochemistry - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Parallel Computational Biochemistry

Description:

Parallel Computational Biochemistry Multiple Sequence Alignment Databases of Biological Sequences Sequence comparison Compare one sequence (target) to many sequences ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 52
Provided by: peopleScs3
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computational Biochemistry


1
Parallel Computational Biochemistry
2
Proteins, DNA, etc.
DNA encodes the information necessary to
produce proteins
Proteins are the main molecular building blocks
of life (for example, structural proteins,
enzymes)
3
Proteins, DNA, etc.
  • Proteins are formed from a chain of molecules
    called amino acids

4
Proteins, DNA, etc.
  • The DNA sequence encodes the amino acid sequence
    that constitutes the protein

5
Proteins, DNA, etc.
  • There are twenty amino acids found in proteins,
    denoted by A, C, D, E, F, G, H, I, ...

6
Multiple Sequence Alignment
7
Databases of Biological Sequences
NCBI 14,976,310 sequences 15,849,921,438
nucleotides
gtBGAL_SULSO BETA-GALACTOSIDASE Sulfolobus
solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKW
VHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWS
RIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIF
KDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEF
ARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELS
RRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMA
ENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRT
EKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRY
HLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLA
DNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEH
LNSVPPVKPLRH
Swiss-Prot 104,559 sequences
38,460,707 residues
PDB 17,175 structures
8
Sequence comparison
  • Compare one sequence (target) to many sequences
    (database search)
  • Compare more than two sequences simultaneously

9
Applications
  • Phylogenetic analysis
  • Identification of conserved motifs and domains
  • Structure prediction

10
(No Transcript)
11
Phylogenetic Analysis
12
Structure Prediction
gt RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPN
TDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIAR
LNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALN
HYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLST
RTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGY
LSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDM
EAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTR
TVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLT
KYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGY
LHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAI
TDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Genomic sequences
13
Our Contributions
  • Parallel min vertex cover for improved sequence
    alignments
  • (to appear in Journal of Computer and System
    Sciences)
  • Parallel Clustal W (ICCSA 2003)
  • In progress Clustal XP portal at
    http//cgm.dehne.net

14
Clustal W
15
Progressive Alignment
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
16
Parallel Clustal
  • Parallel pairwise (PW) alignment matrix
  • Parallel guide tree calculation
  • Parallel progressive alignment

Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
17
Relative Speedup
18
Clustal XP vs. SGI
  • SGI data taken from Performance Optimization of
    Clustal W Parallel Clustal W, HT Clustal, and
    MULTICLUSTAL
  • By Dmitri Mikhailov, Haruna Cofer, and Roberto
    Gomperts

19
Parallel Clustal - Improvements
  • Optimization of input parameters
  • scoring matrices, gap penalties - requires many
    repetitive Clustal W calculations with various
    input parameters.
  • Minimum Vertex Cover
  • use minimum vertex cover to remove erroneous
    sequences, and identify clusters of highly
    similar sequences.

20
Minimum Vertex Cover
  • TASK remove smallest number of gene sequences
    that eliminates all conflicts
  • NP-complete
  • Conflict Graph
  • vertex sequence
  • edge conflict (e.g. alignment with very poor
    score)

21
FPT Algorithms
  • Phase 1 Kernelization
  • Reduce problem to size f(k)
  • Phase 2 Bounded Tree Search
  • Exhausive tree search exponential in f(k)

22
Kernelization
  • Buss's Algorithm for k-vertex cover
  • Let G(V,E) and let S be the subset of vertices
    with degree k or more.
  • Remove S and all incident edges
  • G-gtG k -gt k'k-S.
  • IF G' has more than k x k' edges
  • THEN no k-vertex cover exists
  • ELSE start bounded tree search on G'

23
Bounded Tree Search
24
Case 1 simple path of length 3
remove selected vertices from G' k' - 2
25
Case 2 3-cycle
remove selected vertices from G' k' - 2
26
Case 3 simple path of length 2
remove v1, v2 from G' k' - 1
27
Case 4 simple path of length 1
remove v, v1 from G' k' - 1
28
Sequential Tree Search
  • Depth first search
  • backtrack when k'0 and G'ltgt0 ("dead end" ))
  • stop when solution found (G', k'gt0 )

29
Parallel Tree Search
  • Basic Idea
  • Build top log p levels of the search tree (T ')
  • every proc. starts depth-first search at one leaf
    of T '
  • randomize depth-first search by selecting random
    child

30
Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
31
Simulation Experiment
L 1,000,000
32
Implementation
  • test platform
  • 32 node HPCVL Beowulf cluster
  • each node dual 1.4 GHz Intel Xeon, 512 MB RAM,
    60 GB disk
  • gcc and LAM/MPI on LINUX Redhat 7.2
  • code-s Sequential k-vertex cover
  • code-p Parallel k-vertex cover

33
Test Data
  • Protein sequences
  • Same protein from several hundred species
  • Each protein sequence a few hundred amino acid
    residues in length
  • Obtained from the National Center for
    Biotechnology Information (http//www.ncbi.nlm.nih
    .gov/)

34
Test Data
  • Somatostatin
  • neuropeptide involved in the regulation of many
    functions in different organ systems
  • Clustal Threshold 10, V 559, E 33652, k
    273, k' 255

35
Test Data
  • WW
  • small protein domain that binds proline rich
    sequences in other proteins and is involved in
    cellular signaling
  • Clustal Threshold 10, V 425, E 40182, k
    322, k' 318

36
Test Data
  • Kinase
  • large family of enzymes involved in cellular
    regulation
  • Clustal Threshold 16, V 647, E 113122,
    k 497, k' 397

37
Test Data
  • SH2 (src-homology domain 2)
  • involved in targeting proteins to specific sites
    in cells by binding to phosphor-tyrosine
  • Clustal Threshold 10, V 730, E 95463, k
    461, k' 397

38
Test Data
  • Thrombin
  • protease involved in the blood coagulation
    cascade and promotes blood clotting by converting
    fibrinogen to fibrin
  • Clustal Threshold 15, V 646, E 62731, k
    413, k' 413

39
Test Data
  • PHD (pleckstrin homology domain)
  • involved in cellular signaling
  • Clustal Threshold 10, V 670, E 147054,
    k 603, k' 603

40
Test Data
  • Random Graph
  • V 220, E 2155, k 122, k' 122
  • Grid Graph
  • V 289, E 544, k 145, k' 145

41
Test Data
  • VC V / 2 k' k

42
Sequential Times
Kinase, SH2, Thombin n/a
43
Code-p on Virtual Proc.
44
Parallel Times
45
Speedup Somatostatin
46
Speedup WW
47
Speedup Rand. Graph
48
Speedup Grid Graph
49
Clustal XP
X Extended P Parallel
in progress
50
Clustal XP
http//cgm.dehne.net
51
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com