Sequence Alignment Algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Sequence Alignment Algorithms

1
Sequence Alignment Algorithms Application to
Bioinformatics Tool Development

Dr. S. Parthasarathy
Reader and Head
Department of Bioinformatics
Bharathidasan University
Tiruchirappalli 620 024
(E-mail partha_at_cnld.bdu.ac.in)

2
Plan

Introduction to Bioinformatics
Sequence alignment algorithms
Global alignment Needleman - Wunsch algorithm
Local alignment Smith Waterman algorithm
Predict Fold to a protein
sequence
Methodology
Algorithm, Coding Tool Development
Benchmarking
Conclusions

PredictFold
3
Introduction

Why do we need Bioinformatics?
What is Bioinformatics?
Where is Bioinformatics used?

4
Why?

Biological Data Explosion
How did Biological Data Explosion happen?
Sequence Databases are HUGE than the Structure
Databases
Why so?

5
Introduction Biological Data Genome Projects

Latest Revolution
On 26 June, 2000 - Announcement of completion of
the draft of the Human Genome
Genetic Code of Human Life is Cracked by
Scientists
Human Genome contains 3.2 x 109 bps
Unit of (Genome) sequence length
bps (base pairs)
Mbps (Mega base pairs) 106 bps
Gbps (Giga base pairs) 109 bps
huge (human genome equivalent) 3.2 Gbps
Unit of Genetic distance
centiMorgan (cM) - arbitrary unit Named for
Thomas Hunt Morgan
(e.g. 1 cM 0.01 recombinant frequency)

6
Introduction Biological Data Genome Projects
16 February 2001
15 February 2001
7
Biological Data Recombinant DNA Technology

Old Revolution
1940 Role of DNA as the genetic material was
confirmed
1953 Discovery of DNA structure by James Watson
Francis Crick
1966 Establishment of the Genetic Code
1967 DNA ligase was isolated (join two
strands of DNA together)
Molecular Glue
1970 Isolation of Restriction enzyme
Molecular Scissors
1972 Recombinant DNA molecules were generated
at Stanford
University, USA
1973 Joining DNA fragments to the plasmid
pSC101 isolated from
E.Coli. They could replicate when
introduced into E.Coli.
The discoveries of 1972 1973 triggered off
the biggest scientific revolution Genetic
Engineering

8
Biological Data explosion

GenBank, NCBI, USA
44 Gbps of DNA 40 Million Sequences (upto
2004)
GenBank, National Center for Biotechnology
Information, USA
Protein Data Bank (PDB), RCSB, USA
29,000 structures (2004)
PDB, Research Collaboratory for Structural
Bioinformatics, USA
QUALITY of Data - HIGH
Experimental error in modern genomic sequencing
is extremely low
QUANTITY of Data - HUGE
With Recombinant DNA technology genomic
sequencing, size of sequence data bases is
increasing very rapidly
SEQUENCE Versus STRUCTURE Databases
Sequence Databases are HUGE than Structure
Databases
Leads to Bioinformatics

9
What?

What is Bioinformatics?
Define Bioinformatics

10
Bioinformatics - Definition
F(i,j) max F(i-1, j-1)s(xi,yj), F(i-1, j)
d, F(i, j-1) d.
Bioinformat ics
atcggcatgcatcagtcatgcaactg
PEPTIDESE QSEDITPEP
Bioinformatics is an integration of mathematical,
statistical and computer methods to analyze
biological data. We use computer programs to
make inference from the biological data, to make
connections among them and to derive useful and
interesting predictions. The marriage of biology
and computer science has created a new field
called Bioinformatics. - Arthur M. Lesk
11
Biology Basic Definitions

Cell - It is the building block of living
organisms
Eukaryotic Cells or organisms have the nucleus
separated from the cytoplasm by a nuclear
membrane and the genetic material borne on a
number of chromosomes consisting of DNA and
Protein
Chromosome
The physical basis of heredity. Deeply staining
rod-like structures present with the nuclei of
eukaryotes
Contains DNA and protein arranged in compact
manner
Replicate identically during cell division
Same number of chromosomes present in cells of a
particular species (e.g. Human 22, X and Y)

12
GenomeBasic Definitions

Genome
A complete set of chromosomes inherited from one
parent
Gene
One of the units of inherited material carried on
by chromosomes. They are arranged in a linear
fashion on DNAs. Each represents one character,
which is recognized by its effect on the
individual bearing the gene in its cells. There
are many thousand genes in each nucleus.
DNA (Deoxyribo Nucleic Acid)
DNA is made up of FOUR bases
a t g c adenine, thymine, guanine,
cytosine
Protein
Protein is made up of TWENTY different amino
acids
A T G C ... Alanine, Threonine, Glycine,
Cysteine,

13
Central Dogma
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
14
Genome DataHuman Model Organisms

Most mapping and sequencing technologies were
developed from studies of simpler non-human
organisms
Non-Human/Model organisms
Bacterium Escherichia Coli - 4.6 Mbp
Yeast Saccharomyces Cerevisiae - 12.1 Mbp
Fruit Fly Drosophila melanogaster - 180.0 Mbp
Roundworm C. elegans - 95.5 Mbp
Laboratory Mouse Mus musculus - 3.0 Gbp
Human more complex genome
Human Homo sapiens - 3.2 Gbp

15
Genome DataHuman (Homo Sapiens)

Genome 1
Chromosomes 23
Genes / DNAs 30,000
Nucleotides 3.2 x 109 bps

16
Bioinformatics in Genome Research

Data Collection and Interpretation
Collecting and Storing Data
Sequence generated by genome research will be
used as primary information source for human
biology and medicine
The vast amount of data produced will first need
to be collected, stored and distributed
Interpretation of Data
Recognizing where genes begin and end
Searching a database for a particular DNA
sequence may uncover these homologous sequences
in a known gene from a model organism, revealing
insights into the function of the corresponding
human gene

17
Understanding Gene Function

Correct protein function depends on the 3-D or
folded structure the protein assumes in
biological environments
Understanding protein structure will be essential
in determining gene function

Gene Protein Function
Structure
18
Where?

Where is Bioinformatics used?
What are the uses of Bioinformatics?
Applications of Bioinformatics

19
Bioinformatics Tasks

Sequence Analysis (Protein sequences)
Similarity Homology
pairwise local/global alignment
GCG Seqlab Seqweb
Scoring Matrices - PAM, BLOSUM
Database Search
BLAST, FASTA
Multiple alignment
ClustalW, PRINTS, BLOCKS
Secondary Structure Prediction (from Sequence)
Proteins ?-Helix, ß-Sheet, Turn or coil
Protein Folding

20
Bioinformatics Tasks

Structure analysis Experimental Determination
X-ray crystallography 3 dimensional coordinates
Structure
Nuclear Magnetic Resonance (NMR)
PDB Protein Data Bank
RasMol Molecular Viewing Software
High-throughput crystallographic structure
determination
High flux synchrotron radiation sources (data
collection)
Multiple anomalous diffraction method (data
interpretation)
Bioinformatics - Structure Prediction
Homology Modelling InsightII, SwissPDBViewer,
Biosuite
ab initio method - Monte Carlo Simulation
Protein Structure Classification
SCOP - Structural Classification Of Proteins
CATH - Class, Architecture, Topology, Homologous
superfamily
FSSP - Fold Classification based on Structure-
Structure alignment
of Proteins obtained by DALI
(Distance-matrix
ALIgnment)

21
Bioinformatics Tasks

Protein Engineering
Mutations
Alter particular amino acid/base for desired
effect
Site directed mutagenesis
Identify the potential sites where we can do
alterations
Applications
Agricultural Genetically Modified Plants,
Vegetables, GM Food
Pharmaceutical Molecular Modelling base Drug
Design
Medical Gene Therapy
DNA Bending
Application to Genomes
(Ref M.G.Munteanu, K.Vlahovicek,
S.Parthasarathy, I.Simon and S.Pongor, Rod Models
of DNA Sequence-dependent anisotropic elastic
modelling of local phenomena, Trends in
Biochemical Sciences, 23 (1998) 341-347)

22
Bioinformatics TasksGenomics Proteomics

Genomics is the study of the structure, content,
evolution and functions of genes in genomes
Aims of Genomics
To establish an integrated web based database and
research interface
To assemble Physical,Genetic and Cytological maps
of the Genome
To identify and annotate the complete set of
genes encoded within a genome
To provide the resources for comparison with
other genomes

23
Proteomics Proteome

Proteome is the complete collection of proteins
in a cell/tissue/organism at a particular time.
Unlike genomes, which are stable over the life
time of the organism, proteomes change rapidly as
each cell response to its changing environment
and produces new proteins and at different
amounts.
Genome is a more stable entity. An organism has
only one genome but many proteomes.
For an organism, there may be
one body wide proteome,
about 200 tissue proteomes
about a trillion (1012) individual cell
proteomes.

24

Proteomics Definition

The study of proteomes that includes determining
the 3D shapes of proteins, their roles inside
cells, the molecules with which they interact,
and defining which proteins are present and how
much of each is present at a given time.

25

Proteomics Applications

To correlate proteins on the basis of their
expression profiles.
To observe patterns in protein synthesis and this
observed pattern changes can be used as an
indicator of the state of cell and its gene
expression.
To characterize bacterial pathogens and to
develop novel antimicrobials.
To identify regions of the bacterial genome that
encode pathogenic determinants.
To develop drugs and in toxicology Structural
Proteomics
Proteomics as a tool for plant genetics and
breeding

26
Systems Biology

Systems Biology is a new perspective and emerging
field for research in the post-genomic era.
It aims at system level understanding of
biological systems.
It studies whole cells/tissues/organisms not by a
traditional reductionists approach but by
holistic means in a reiterative attempt to model
the complete cell/tissue/organism.
It is an integrated and interacting network of
genes, proteins and biochemical reactions which
give rise to life.

27
Systems Biology
28
Sequence Alignment Algorithms

Similarity and Homology
Sequence Comparison - Issues
Types of alignments
Algorithms Used

29
Sequence similarity and homology

Nature is a tinkerer and not an inventor. New
sequences are adapted from pre-existing sequences
rather than invented de novo . There exists
significant similarity between a new sequence and
already known sequences. Fortunate for
computational sequence analysis
Similarity Measurement of resemblance and
differences, independent of the source of
resemblance.
Homology The sequences and the organisms in
which they occur are descended from a common
ancestor.
If two related sequences are homologous, then we
can transfer information about structure and/or
function, by homology.

30
3-D Structure and Homology

3-D structure patterns (motifs) of proteins are
much more evolutionarily conserved than amino
acid sequences - This type of Homology search
could prove more fruitful
Particular motifs may serve similar functions in
several different proteins, information that
would be valuable in genome analysis
Only a few protein motifs can be recognised at
the sequence level
Development of more analytic capabilities to
facilitate grouping protein sequences into motif
families will make homology searches more useful

31
Sequence ComparisonIssues

Types of alignment
Global end to end matching
(Needleman-Wunsch)
Local portions or subsequences matching
(Smith-Waterman)
Scoring system used to rank alignments
PAM BLOSUM matrices
Algorithms used to find optimal (or good) scoring
alignments
Heuristic
Dynamic Programming
Hidden Markov Model (HMM)
Statistical methods used to evaluate the
significance of an alignment score
Z- score, P- value and E- value

32
Substitution Matrices

PAM (Point Accepted Mutation)
BLOSUM (BLOcks SUbstitution Matrix)

40
90
Close
62
Default
250
Distant
500
30
33
Types of Algorithms

Heuristic
A heuristic is an algorithm that will yield
reasonable results, even if it is not provably
optimal or lacks even a performance guarantee.
In most cases, heuristic methods can be very
fast, but they make additional assumptions and
will miss the best match for some sequence pairs.
Dynamic Programming
The algorithm for finding optimal alignments
given an additive alignment score dynamically
(We are going to discuss about it soon.)
These type of algorithms are guaranteed to
find the optimal scoring alignment or set of
alignments.
HMM - Based on Probability Theory very
versatile.

34
Global AlignmentNeedleman-Wunsch Algorithm

Formula
F(i-1,j-1)
s(xi,yj) D
F(i, j) max F(i-1 , j) - d
H
F(i , j-1) - d
V

F(i-1,j-1) D F(i,j-1) V
F(i-1,j) H F(i,j)
35
Global AlignmentNeedleman-Wunsch Algorithm

Gap penalties
Linear score f(g) - gd
Affine score f(g) - d (g-1) e
d gap open penalty e gap extend penalty
g gap length
Trace back
Take the value in the bottom right corner and
trace back till the end. (i.e. align end end
always).
Algorithm complexity
It takes O(nm) time and O(nm) memory, where n and
m are the lengths of the sequences.

36
Local AlignmentSmith-Waterman Algorithm

Same as Global alignment algorithm with
TWO differences.
F(i,j) to take 0 (zero), if all other options
have value less than 0.
Alignment can end anywhere in the matrix.
Take the highest value of F(i,j) over the whole
matrix and start trace back from there.

37
Local AlignmentSmith-Waterman Algorithm

Formula
F(i-1,j-1) S(xi,yj)
D
F(i, j) max F(i-1 , j) - d
H
F(i , j-1) - d
V
0 (if all other
value is lt 0)

F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
38
Web based server development

Design the web page to get the data
Use cgi-bin or Perl script to parse the submitted
data
Invoke the corresponding program to get the
appropriate results
Send the results either by e-mail or to the web
page directly

39
Application to Bioinformatics Tool Development

To predict a fold to protein sequence

PredictFold
40
To predict a fold to protein sequence
PredictFold

To predict possible folds for a given protein
sequence, whose structure is not known
To develop a fold recognition technique / tool
that is sensitive in detecting folds of given
protein sequences in the twilight zone (sequences
sharing less than 25 identity)
Application of the fold recognition strategy to
genomic annotation

41
Twilight Zone sequencesExampleCytochrome
Sequences

256b
gt256BA CYTOCHROME B562 (OXIDIZED) - CHAIN A
ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLED
KSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRN
AYHQKYR
gt256BB CYTOCHROME B562 (OXIDIZED) - CHAIN B
ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLED
KSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRN
AYHQKYR
2ccy
gt2CCYA CYTOCHROME C(PRIME) - CHAIN A
QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMV
AKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAA
AKAGPDALKAQAAATGKVCKA CHEEFKQD
gt2CCYB CYTOCHROME C(PRIME) - CHAIN B
QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMV
AKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAA
AKAGPDALKAQAAATGKVCKA CHEEFKQD

42
ExampleSequences similarity

lalign output
for
256b 2ccy
follows

43
ExampleCytochrome Structures
256b
CYTOCHROME STRUCTURES (seq. similarity 24)
2ccy
44
Goals

Exploration of suitable fold recognition
techniques that are sensitive in detecting
similar folds despite low sequence similarity
Identification of functional motifs in proteins
at sequence (1D) and structure (3D) level
Development of a protocol that aid in the rapid
classification and annotation of genomic data
based on functional motifs

45
Methodology

Reduction of 3D-structure to 1D-environment
string. Environment at each residue position is a
function of local secondary structure and extent
of exposure to the solvent (based on 3D-1D
profile method developed by Eisenberg et al.,
1991).
Extract residue environment profiles of the
available protein structures.
A scoring matrix is generated from a library of
profiles. Each matrix element is the information
value of a residue in the given environment.
A library of environment strings is created for
the available protein fold structures.
The probe sequence is queried against this
library to look for best matches.

46
Workflow
47
Residue Environments
_Helix
Partially buried
_Exposed
_Coil
Strand_
Buried_
48
Residue Environments

The residue environments are described by
the area (A) of the residue buried in the protein
the fraction (f) of side-chain area that is
covered by polar atoms (O and N)
the local secondary structure

49
Residue Environments

CLASS Area (A) Å2 FRACTION (f)
BURIED 1 (B1) A gt 114 f lt 0.45
BURIED 2 (B2) 0.45 lt f lt 0.58
BURIED 3 (B3) f gt 0.58
PARTIAL 1 (P1) 40 lt A lt 114 f lt 0.67
PARTIAL 2 (P2) f gt 0.67
EXPOSED (E0) A lt 40 f gt 0.67

50
Residue Environment classes

We have 6 classes based on the extend of exposure
to solvent
We have 3 classes based on secondary structure
Alpha Helix(A), Beta Sheet (B) Coil(C)
Total 6 x 3 18 environments
B1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C
P1A,P1B,P1C, P2A,P2B,P2C, E0A,E0B,E0C.
For example
B1A - Buried 1 Alpha Helix
P2B - Partially Buried 2 Beta Sheet
E0C - Exposed 0 Coil

51
Scoring Table

The scoring table used in this case is a 20 x 18
matrix, constructed from a statistical analysis
of the profile library (consisting of 1200
protein structures) provided by PROFILES_3D
module of Insight II (Accelrys Inc.)
The scores Sij are calculated using the formula
Sij ln P(i j) / Pi x 100
where P(i j) is the probability of
finding residue i in the environment j and Pi
is the overall probability of finding residue i
in any environment.

52
Scoring Table

The scoring table contains measure of the
compatibility of the 20 amino acids with the 18
environmental classes.
The individual matrix elements are propensities
(information values) for the amino acid residues.

53
Scoring Table
54
Fold Library
1565 Functional forms
Scan PDB to identify all the structures having
these folds
Identify a representative structure with
resolution 2.5Å or better
Quality of the structure (Occupancy, R-Factor,
Stereochemistry)
968 Chains
55
DALI / FSSP Fold Library

DALI http//www.ebi.ac.uk/dali
Touring protein fold space with DALI/FSSP. Lisa
Holm and Chris Sander, Nucleic Acid Research,
(1998), 26, 316-319
Mapping the Protein Universe, Lisa Holm and
Chris Sander, Science, (1996), 273, 595-602

56
Sequence ComparisonDetails

Type of Alignment
Local - portions or subsequences matching
Smith-Waterman Algorithm
Scoring Table 3D-1D matrix
Algorithm used Dynamic Programming
Alignment Score Z- Score

57
Local AlignmentSmith-Waterman Algorithm

Formula
F(i-1,j-1) S(xi,yj)
D
F(i, j) max F(i-1 , j) - d
H
F(i , j-1) - d
V
0 (if all other
value is lt 0)

F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
58
Gap Penalties

Gap penalties
Linear score f(g) - gd
Affine score f(g) - d (g-1) e
d gap open penalty e gap extend penalty
g gap length
Gap penalty values used are
d 500
e 50

59
Local Alignment

Trace back
Alignment can end anywhere in the matrix
Take the highest value of F(i,j) over the whole
matrix and start trace back from there.
Algorithm complexity
It takes O(nm) time and O(nm) memory, where n and
m are the lengths of the sequences.

60
Significance of an Alignment Score

Statistical methods used to evaluate the
significance of an alignment score
Z-score, P-value and E-value
Significance of Score
Z- score (score mean)/std. dev
Measures how unusual our original match is.
Z ? 5 are significant.
P- value measures probability that the alignment
is no better than random. (Z and P depends on the
distribution of the scores)
P ? 10-100 exact match.
E- value is the expected number of sequences that
give the same Z- score or better. (E P x size
of the database)
E ? 0.02 sequences probably homologous

61
Benchmarking

All 968 proteins in the fold library were
profiled on each of the other members
A histogram indicating the rank and the number of
sequences which got the self score as the
highest, is shown in Figure.

62
Benchmarking
63
Benchmarking

Report
797 retain the self as the highest score
63 report the self to have the second highest
score
There were about 100 proteins that have ranks
between 5 and 100.
Limitations
Prediction is restricted to the 968 folds in the
library
The algorithm is insensitive to partially folded
sequences
Specific to globular proteins and not for
membrane proteins
Sequences that fold in the presence of cofactors
and ligands are not accounted for

64
Web based server development

Design the web page to get the data
Use cgi-bin or Perl script to parse the submitted
data
Invoke the corresponding program to get the
appropriate results
Send the results either by e-mail or to the web
page directly
Prepare a user manual to describe the salient
features of the server

65
Conclusions

PredictFold A program to predict possible folds
for a new protein sequence based on the 3D-1D
profile method
Benchmarking results show the reliability of the
method
There are lot of scopes for further improvements

66
Future Directions

To update the fold library by including more
known folds
To use the predicted secondary structure
information of the given sequence also
To optimise the source code for efficient
handling of genome sequences, automatically
To combine results from other algorithms ORF,
HMM, etc. to detect remote homologs
To develop maintain a web-based sever for fold
recognition

67
BT versus IT

Bioinformatics including Biotechnology (BT)
requires lot of Information Technology (IT)
skills for Genomic annotation projects
Bioinformatics is one of the potential areas for
IT professionals also
Genome Projects will be the next huge task for IT
industries (like the Y2K problem in the past)
BT will take on IT soon in the near future

68
Conclusions

Developing Web based Bioinformatics tools
Develop/modify useful algorithms
Generate computer source codes
Create/Maintain Web based server
Using existing Web based tools efficiently
Ethical issues
Bioethics Biosafety Ensure always that any
bioinformatics tool harmful to environment
society has neither been developed nor been used
by you
Cloning of human, Terminator technology, GM Food,
etc.

69
References (latest)

Arthur M. Lesk, Introduction to Bioinformatics,
Oxford University Press, New Delhi (2003).
D. Higgins and W. Taylor (Eds), Bioinformatics-
Sequence structure and databanks, Oxford
University Press, New Delhi (2000).
R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison,
Biological Sequence Analysis, Cambridge Univ.
Press, Cambridge, UK (1998).
A. Baxevanis and B.F. Ouellette, Bioinformatics
A Practical Guide to the Analysis of Genes and
Proteins, (Third Edition) Wiley-Interscience,
Hoboken, NJ (2005).
G.Gibson and S.V.Muse, A Primer of Genome
Science, Sinauer Associates, USA (2002).
N. C. Jones and P. A. Pevzner, An Introduction to
Bioinformatics Algorithms, Ane Books, New Delhi
(2005).
Michael S. Waterman, Introduction to
computational Biology, Chapman Hall, (1995).
J. A. Clasel and M. P. Deutscher (Eds),
Introduction to Biophysical Methods for Protein
and Nucleic Acid Research, Academic press, New
York (1995).
D.S. T.Nicholl, An Introduction to Genetic
Engineering, (Second Edition) Cambrdige Univ.
Press, UK (2002).

70
References

3D-1D Profile method
J.U.Bowie, E.Luthy D.Eisenberg, Science, 253,
164-170 (1991).
Ostensible Recognition of Folds (ORF) method
Rajeev Aurora and George D.Rose, Proc. Natl.
Acad. Sci. (USA), 95(6), 2818-2823 (1998).
Superfamily Hidden Markov Model (SHMM) method
A.Krogh, M.Brown, IS.Mian, K.Sjolander and
D.Haussler, J. Mol. Biol. 235(5), 1501-31 (1994).

71
ImportantBioinformatics Resources

Databases Tools
NCBI, NIH - www.ncbi.nlm.nih.gov
EMBL, EBI - www.ebi.ac.uk
ExPasy, Swiss - www.expasy.org
DDBJ - www.ddbj.nig.ac.jp
PDB - www.rcsb.org/pdb
Software
Accelrys - www.accelrys.com/products
GCG, Insight II, Cerius II, Discovery Studio
TCS - www.atc.tcs.co.in/biosuite/
BIOSUITE
Jalaja Technologies - www.jalaja.com
GENOCLUSTER

Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Sequence Alignment Algorithms PowerPoint PPT Presentation