Iosif Vaisman - PowerPoint PPT Presentation

About This Presentation
Title:

Iosif Vaisman

Description:

Yeast chromosome 3 350,000. Escherichia coli (bacterium) genome 4,600,000. Largest yeast chromosome now mapped 5,800,000. Entire yeast genome 15,000,000 ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 76
Provided by: MML76
Learn more at: http://mason.gmu.edu
Category:
Tags: iosif | vaisman

less

Transcript and Presenter's Notes

Title: Iosif Vaisman


1
Introduction to Bioinformatics
  • Iosif Vaisman

Email ivaisman_at_gmu.edu
2
NIH working definition of bioinformatics and
computational biology (July 2000)
The NIH Biomedical Information Science and
Technology Initiative Consortium agreed on the
following definitions of bioinformatics and
computational biology recognizing that no
definition could completely eliminate overlap
with other activities or preclude variations in
interpretation by different individuals and
organizations. Bioinformatics Research,
development, or application of computational
tools and approaches for expanding the use of
biological, medical, behavioral or health data,
including those to acquire, store, organize,
archive, analyze, or visualize such
data. Computational Biology The development and
application of data-analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social systems.
3
Bioinformatics bibliography(papers with the word
bioinformatics in title or abstract)
4
Dynamics of Database Growth
5
Comparative Sequence Sizes
  • Yeast chromosome 3
    350,000
  • Escherichia coli (bacterium) genome
    4,600,000
  • Largest yeast chromosome now mapped
    5,800,000
  • Entire yeast genome
    15,000,000
  • Smallest human chromosome (Y)
    50,000,000
  • Largest human chromosome (1)
    250,000,000
  • Entire human genome
    3,000,000,000

6
The String Alignment Problem
string - a sequence of characters from some
alphabet
given two strings acbcdb and cadbd
one of possible alignments
a c - - b c d b - c a d b - d -
score 3 . (2) 5 . (-1) 1
scoring function exact match 2 mismatch
-1 insertion -1
7
The String Alignment Problem
given two strings CTCATG and TACTTG
C T C A T G T A C T T G
score 3 . (2) 3 . (-1) 3
C T C A - T - G . T - A C T T G
score 4 . (2) 4 . (-1) 4
8
Entropy and Redundancy of Language
CUR F W D DIS AND P A
SED IEND ROUGHT EATH EASE AIN
BLES FR B BR AND AG
9
Entropy and Redundancy of Language
CUR FW D DISAND
P
BLESFRBBRAND
AG
The sequences are 65 identical
A CURSED FIEND WROUGHT DEATH DISEASE AND
PAIN
A BLESSED FRIEND BROUGHT BREATH AND EASE
AGAIN
10
Substitution Matrices
  • Dayhoff (or MDM, or PAM) - Derived from global
    alignments of closely related sequences PAM100 -
    number referes to evolutionary distance
    (Percentage of Acceptable point Mutations per 108
    years)

300 million years
200 million years
100 million years
11
Substitution Matrices
  • BLOSUM (BLOcks SUbstitution Matrix) -Derived
    from local, ungapped alignments of distantly
    related sequences BLOSUM62 - number refers to
    the minimum percent identity

Reference Henikoff Henikoff Proteins 1749,
1993
12
Selecting a Matrix
  • Compared sequences are related 200 PAM or 250
    PAM
  • Database scanning 120 PAM
  • Local alignment search 40 PAM, 120 PAM, 250 PAM
  • Detection of related sequences using BLAST
    BLOSUM 62

Low PAM short segments, high similarity High
PAM long segments, low similarity
THERE IS NO ONE SIZE FITS ALL MATRIX !
13
Matrix Example
A B C D E F G H I
K .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1
0.0 0.0 .. A 1.1 -0.4 1.1 0.7 -0.7
0.6 0.4 -0.2 0.4 .. B 1.5 -0.5
-0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C
1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D
1.5 -0.7 0.5 0.4 -0.2 0.3
.. E 1.5 -0.6 -0.1
0.7 -0.7 .. F
1.5 -0.2 -0.3 -0.1 .. G
1.5 -0.3 0.1 .. H
1.5 -0.2 .. I
1.5 .. K
14
Dayhoffs Acceptable Point Mutations
Ala A Arg R 30 Asn N 109 17 Asp D 154 0
532 Cys C 33 10 0 0 Gln Q 93 120 50 76
0 Glu E 266 0 94 831 0 422 Gly G 579 10 156
162 10 30 112 His H 21 103 226 43 10 243 23
10 Ile I 66 30 36 13 17 8 35 0 3 Leu
L 95 17 37 0 0 75 15 17 40 253 Lys K
57 477 322 85 0 147 104 60 23 43 39 Met M
29 17 0 0 0 20 7 7 0 57 207
90 Phe F 20 7 7 0 0 0 0 17 20 90
167 0 17 Pro P 345 67 27 10 10 93 40 49
50 7 43 43 4 7 Ser S 772 137 432 98 117
47 86 450 26 20 32 168 20 40 269 Thr T 590
20 169 57 10 37 31 50 14 129 52 200 28
10 73 696 Trp W 0 27 3 0 0 0 0 0
3 0 13 0 0 10 0 17 0 Tyr Y 20 3
36 0 30 0 10 0 40 13 23 10 0 260
0 22 23 6 Val V 365 20 13 17 33 27 37
97 30 661 303 17 77 10 50 43 186 0 17
A R N D C Q E G H I L K
M F P S T W Y Ala Arg Asn Asp
Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser
Thr Trp Tyr
15
Search and alignment entropy
  • Information content per position pam10 -
    3.43 bits pam120 - 0.98 bits
    pam160 - 0.70 bits pam250 - 0.38
    bits blosum62 - 0.70 bits
  • Information requirements for search -
    30 bits for alignment - 16 bit

16
Search and alignment entropy
Recommended matrices for different query length
  • Query length Substitution matrix Gap
    costs
  • lt35 PAM-30 ( 9,1)
  • 35-50 PAM-70 (10,1)
  • 50-85 BLOSUM-80 (10,1)
  • gt85 BLOSUM-62 (11,1)

17
FASTA Algorithm
1
First run (identities)
18
FASTA Algorithm
The score of the highest scoring initial region
is saved as the init1 score.
19
FASTA Algorithm
3
Joining threshold - eliminates disjointed segments
Non-overlapping regions are joined. The score
equals sum of the scores of the regions minus a
gap penalty. The score of the highest scoring
region, at the end of this step, is saved as the
initn score.
20
FASTA Algorithm
4
Alignment optimization using dynamic programming
The score for this alignment is the opt score.
21
FASTA Algorithm
FastA uses a simple linear regression against the
natural log of the search set sequence length to
calculate a normalized z-score for the sequence
pair. Using the distribution of the z-score, the
program can estimate the number of sequences that
would be expected to produce, purely by chance, a
z-score greater than or equal to the z-score
obtained in the search. This is reported as the
E() score.
22
FASTA Results
  • When init1init0opt
    100 homology over the matched stretch.
  • When initn gt init1
    more than 1 matching region in the database
    with poorly matching separating regions.
  • When opt gt initn
    the matching regions are greatly
    improved by adding gaps in one or both of the
    sequences.

23
BLAST - Basic Local Alignment Search Tool
  • Blast programs use a heuristic search algorithm.
    The programs use the statistical methods of
    Karlin and Altschul (1990,1993).
  • Blast programs were designed for fast database
    searching, with minimal sacrifice of sensitivity
    to distant related sequences.

24
BLAST Algorithm
1
Query sequence of length L
Maximium of L-w1 words (typically w 3 for
proteins)
For each word from the query sequence find the
list of words with high score using a
substitution matrix (PAM or BLOSUM)
Word list
25
BLAST Algorithm
2
Database sequences
Word list
Exact matches of words from the word list to the
database sequences
26
BLAST Algorithm
3
Maximal Segment Pairs (MSPs)
For each exact word match, alignment is extended
in both directions to find high score segments
27
Gapped BLAST
  • The Gapped Blast algorithm allows gaps to be
    introduces into the alignments. That means that
    similar regions are not broken into several
    segments.
  • This method reflects biological relationships
    much better.

28
BLAST family of programs
  • blastp - amino acid query sequence against a
    protein sequence database
  • blastn - nucleotide query sequence against a
    nucleotide sequence database
  • blastx - nucleotide query sequence translated
    in all reading frames against a protein
    database
  • tblastn - protein query sequence against a
    nucleotide sequence database dynamically
    translated in all reading frames
  • tblastx - six-frame translations of a
    nucleotide query sequence against the
    six-frame translations of a nucleotide sequence
    database.

29
Database Searches
  • Run Blast first, then depending on your results
    run a finer tool (Fasta, Smith-Waterman, etc.)
  • Where possible use translated sequence.
  • E() lt 0.05 is statistically significant, usually
    biologically interesting. Check also 0.05 lt E()
    lt10 because you might find interesting hits.
  • Pay attention to abnormal composition of the
    query sequence, it usually causes biased scoring.
  • Split large query sequence ( if gt1000 for DNA,
    gt200 for protein).
  • If the query has repeated segments, remove them
    and repeat the search.

30
Documenting the Search
  • Algorithm(s)
  • Substitution matrix
  • Gap penalty (FASTA)
  • Name of database
  • Version of database
  • Computer used

31
MULTIPLE SEQUENCE ALIGNMENT
32
Computational complexity
Alignment of protein sequences with 200 amino
acid residues
33
Multiple alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWESNG--
Column cost the sum of costs for all possible
pairs
34
Multiple alignment
A correct multiple alignment corresponds to an
evolutionary history no correct way to
determine practical way - to find an alignment
with the maximum score
35
Multiple sequence alignment
Given k (k gt 2) sequences, s1,, sk, each
sequence consisting of characters from an
alphabet A multiple alignment is a a rectangular
array, consisting of characters from the
alphabet A (A "-"), that satisfies the
following 3 conditions 1. There are exactly k
rows. 2. Ignoring the gap character, row number
i is exactly the sequence si. 3. Each column
contains at least one character different from
"-".
36
Consensus
Plurality - minimum number of votes for a
consensus Threshold - scoring matrix value below
which a symbol may not vote
for a coalition. Sensitivity - minimum score to
select consensus Profiles - blocks of
prealigned sequences
37
Multiple alignment algorithm
1. Pairwise alignments (progressive pairwise
alignments) 2. Distance matrix calculation 3.
Guide tree creation (hierarchical clustering) 4.
New sequence addition
38
Scoring system (distances)
Sreal(ij) - observed similarity score for two
aligned sequences i and j Siden(ij) - average
of the two scores for each sequence aligned
with itself Srand(ij) - average score determined
from 100 global randomizations of the two
sequences
The distances D(ij) are used to generate the
distance matrix from which the approximate guide
tree is generated.
39
Multiple alignment
40
Multiple alignment
Segment - line joining two vertices Each unit
m-dimensional cube in the lattice contains 2m
-1 segments
41
Multiple alignment
Alignment Path for 3 Sequences
(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),
(4,3,2)
42
Multiple alignment
V S N - S - S N A - - - - A S
Pairwise Projections of the Alignment
43
Alignment statistics
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi 1
2 3 4 5 6 7 8
478 67 65 19 19
18 42 43 1 0 82 80
39 39 36 64 65 0
1 0 5 5 12 2 2
327 483 58 16 16 16
39 41 2 400 0 75 38
38 35 62 63 5 0
0 5 5 12 1 1
318 284 482 18 18 17 40
43 3 390 367 0 38 38
35 64 64 4 1 0
5 5 12 1 1 96
84 95 494 95 74 20 21 4
198 192 194 0 98 84
40 41 30 29 28 0
0 7 6 5
44
Alignment score
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi
1 2 3 4 5
6 7 8 1 4077 2 5358
4129 3 5323 5650 4096 4 8103
8229 8112 4210 5 8109 8243
8118 4332 4219 6 8535 8672
8575 5511 5519 4261 7 6474
6531 6500 8103 8119 8572
4103 8 6392 6434 6378 8033
8035 8520 5508 4083 1
2 3 4 5 6 7
8
45
Alignment visualization
Identity
Summary view
46
Alignment visualization
Physico-chemical properties
Differences mode
47
Alignment visualization (tree)
48
Sequence Logos a quantitative graphical display
for binding sites and proteins
Reference Schneider, T.D. Meth. Enzym 274445,
1996
49
Sequence Logos
50
Sequence Logos
51
Multiple Alignment Programs
  • Pileup (GCG) Needleman and Wunsch algorithm for
    pairwise alignment and UPGMA method for tree
    construction
  • CLUSTAL Wilbur and Lipman algorithm for pairwise
    alignment (CABIOS 8189, 1992)
  • PIMA pattern-matching based algorithm (PNAS
    87118, 1990)
  • TreeAlign phylogenetic algorithm (Meth. Enzymol.
    18626, 1990)

52
Patterns in protein sequences
53
Regular Expressions
Patterns described in a standard way are known as
regular expressions
54
Regular Expressions
AC-x-V-x(4)-ED. Ala or Cys-any-Val-any-any-
any-any-any but Glu or Asp ...LKHVAYVFQALIYWI
K... ...AVEMAGVKYLQVQHGS... ...LYTGAIVTNNDGPYMA..
. ...KEYKCKVEKELTDICN...
55
PROSITE Database
Current version contains 1079 documentation
entries that describe 1459 different patterns,
rules and profiles/matrices ST-x(2)-DE
Casein kinase II phosphorylation site
AG-x(4)-G-K-ST ATP/GTP-binding site
motif A (P-loop) Y-x-NQH-K-DE-IVA-F-LM-R
-ED Heat shock hsp90 proteins family
signature http//www.expasy.ch/prosite
56
Blocks Database
Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins
N-6 Adenine-specific DNA methylases
proteins width9 seqs78
DMA_VIBCHQ08318 (85) SCTQWWPPF 77
HEMK_MYCLEP45832 (181) DLFVAQPTL 100
MT57_ECOLIP25240 (111) DGALGNPPF 13
MTC1_CHVN1Q01511 (172) NFVFLDPPY 8
MTC1_COREQP42828 (71) QLSFSCPPF 49
MTH2_HAEHAP00473 (32) KIAFFDPQY 52
MTH3_HAEINP43871 (23) HAIISDIPY 73
MTM1_MICAMP50190 (306) AAVLTNPPF 14
MTM2_MORBOP23192 (25) QLAVIDPPY 10
MTMU_MYCSPP43641 (37) QVIYADPPW 13
MTR1_RHOSHP14751 (60) QLIICDPPY
8 ....................................
http//www.blocks.fhcrc.org/
57
Pfam Database
Pfam is a large collection of multiple sequence
alignments and hidden Markov models covering
many common protein domains
Zinc finger, C2H2 type
TYY1_HUMAN/383-407 YVCPF.DGCN...KKFAQSTNLKSHILT..
.H ZG52_XENLA/61-83 YTCT...QCN...KQFSHSAQLRAHI
ST...H KRUP_DROME/306-328 YTCE...ICD...GKFSDSNQL
KSHMLV...H YKQ8_CAEEL/78-102
YKCT...VCR...KDISSSESLRTHMFKQ.HH
DEFI_CHICK/268-292 YECP...NCK...KRFSHSGSYSSHISSK
.KC ZFH1_DROME/389-413 FGCD...NCG...KRFSHSGSFSSH
MTSK.KC YL57_CAEEL/42-65 YLCY...YCG...KTLSDRLE
YQQHMLK..VH ZFA_MOUSE/542-564
FKCD...ICL...LTFSDTKEVQQHALV...H
BASO_HUMAN/719-742 FQCD...ICK...KTFKNACSVKIHHKN.
.MH HUNB_DROME/297-319 FQCD...KCS...YTCVNKSMLNSH
RKS...H SFP1_YEAST/598-623 FKCPV.IGCE...KTYKNQNG
LKYHRLH..GH ZG29_XENLA/62-84
FVCT...VCG...KTYKYKHGLNTHLHS...H
http//pfam.wustl.edu/
58
Other Motif Databases
PRINTS a compendium of protein fingerprints. A
fingerprint is a group of conserved motifs used
to characterise a protein family http//bioinf.ma
n.ac.uk/dbbrowser/PRINTS/ DOMO a protein
domain database http//www.infobiogen.fr/gracy/do
mo/home.htm ProDom a protein domain database
http//protein.toulouse.inra.fr/prodom.html
59
InterPro Database
InterPro integrated resource for the commonly
used signature databases - Pfam, PRINTS,
PROSITE, ProDom and SWISS-PROT
TrEMBL. Current release of InterPro (3.2)
contains 3939 entries, representing 1009
domains, 2850 families, 65 repeats and 15
post-translational modification
sites. http//www.ebi.ac.uk/interpro
60
InterPro Database
61
From genes to proteins
DNA
PROMOTER ELEMENTS
TRANSCRIPTION
RNA
SPLICE SITES
SPLICING
mRNA
START CODON
STOP CODON
TRANSLATION
PROTEIN
62
From genes to proteins
63
(No Transcript)
64
Chromosome 19 gene map
65
Computational Gene Prediction
  • Where the genes are unlikely to be located?
  • How do transcription factors know where to bind a
    region of DNA?
  • Where are the transcription, splicing, and
    translation start and stop signals?
  • What does coding region do (and non-coding
    regions do not) ?
  • Can we learn from examples?
  • Does this sequence look familiar?

66
Measures of Prediction Accuracy
Nucleotide Level
67
Measures of Prediction Accuracy
Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
REALITY
PREDICTION
68
Spliced Alignment (Procrustes)
  • New genomic sequence
  • Selection of candidate exons AUG --- GU initial
    exons AG --- GU internal exons AG --- UAA or
    UAG or UGA terminal exons
  • Filtration (based on the codon usge statistics)
  • Construction of all possible chains of candidate
    exons
  • Finding a chain with the maximum global
    similarity to the target protein

69
Spliced Alignment (Procrustes)
70
Predicted Exon Assembly(Procrustes)
71
PCR Primers Prediction (GenePrimer)
Exon 1085..1182 (98) hit using first 2 primers
Exon 1628..1676 (49) missed Exon 1900..2001
(102) hit using first 8 primers Exon 2110..2184
(75) missed Exon 2516..2722 (207) hit using
first 4 primers Exon 3385..3472 (88) missed
Exon 3546..3746 (201) hit using first primer ...
72
GRAIL gene identification program
73
Suboptimal Solutions for the Human Growth Hormone
Gene (GeneParser)
74
GeneMark Accuracy Evaluation
75
Bibliography http//linkage.rockefeller.edu/wli/ge
ne/list.html and http//www-hto.usc.edu/software/p
rocrustes/fans_ref/
Gene Discovery Exercise http//metalab.unc.edu/pha
rmacy/Bioinfo/Gene
Write a Comment
User Comments (0)
About PowerShow.com