Bioinformatics 101 BimCore www.bimcore.emory.edu - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Bioinformatics 101 BimCore www.bimcore.emory.edu

Description:

Default Scoring Matrix (Smith Waterman) for DNA. A B C D G H K M N R S T U V W X Y ... Smith Waterman Algorithm. Guaranteed to find all significant matches to ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 50

Provided by: jesses6

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics 101 BimCore www.bimcore.emory.edu

1
Bioinformatics 101BimCore www.bimcore.emory.edu

Bioinformatics is the computational management,
analysis and dissemination of biological
information.
3 main types of data genome sequences, protein
sequences and structure, and functional genomics
(expression data).
Genomics mapping, sequencing and analysis of
genomes (sequences, structural and functional).
Proteomics qualitative and quantitative
comparison of a proteome (complete set of
proteins of an organism) under different
conditions to unravel biological processes.

2
Information available

Databases
Genome sequences - Flybase, GeneBank, EMBL,
SwissProt, PDB, ESTs, BACs, etc.
Experimental information
Published information
PubMed, MedLine

3
How locate relevant information

Text searches
Query matches
Sort by relevance
gtgt sequence to sequence comparisons

4
Definitions

Homolog Sequences that share a common
ancestor may have similar function.
Paralogue Similar sequence within species, may
have similar function.
Orthologue Same sequence separated by a
speciation event, probably same function.
Analog Non-homolog proteins that have similar
folding or similar functional sites, which arose
through convergent evolution.

5
Evolution and Alignment

Similarity is an observable quantity (
identity).
Homology is a conclusion drawn from the data,
thus two genes share a common evolutionary
history.
Alignments reflect amino acid substitutions, and
insertions and deletions.
Certain regions are more conserved than others.
In biomolecular sequences (DNA, RNA, AA seqs),
high sequence similarity usually implies
significant structural and / or functional
similarity.

6
Sequence alignment based onEdit Distance

Transforming one string into another by a series
of edit operations on the individual characters.
The operations allowed are insertion of a
character in the first string, deletion of a
character from the first string, or the
substitution of a character in the first string
with a character in the second string.
To limit the transforming operations to one
string, an insertion in one string can be
considered a deletion in the other string and
visa-versa.

7
Edit Distance Example

v-intner- vintner
RIMDMDMMI RRRMRRR
wri-t-ers writers
Edit Distance 5 Edit Distance 6
(R)eplacements 1, (I)nsertions/(D)eletions1,
(M)atches0
Edit Distance (sometimes referred to as
Levenshtein distance, Sov. Phys. Dokl. 10707-10,
1966) is the minimum number of edit operations to
transform one string into the other.

8
Similarity Example

v-intner- vintner
RIMDMDMMI RRRMRRR
wri-t-ers writers
Similarity 4 Similarity 1
(R)eplacements 0, (I)nsertions/(D)eletions0,
(M)atches1
Similarity is value of the alignment that
maximizes the total alignment value.

9
DNA Scoring Scheme

Default Scoring Matrix (Smith Waterman) for DNA
A B C D G H K M N R
S T U V W X Y
A 10 -9 -9 10 -9 10 -9 10 10
10 -9 -9 -9 10 10 10 -9
B -9 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10
C -9 10 10 -9 -9 10 -9 10 10
-9 10 -9 -9 10 -9 10 10
D 10 10 -9 10 10 10 10 10 10
10 10 10 10 10 10 10 10
G -9 10 -9 10 10 -9 10 -9 10
10 10 -9 -9 10 -9 10 -9
H 10 10 10 10 -9 10 10 10 10
10 10 10 10 10 10 10 10
K -9 10 -9 10 10 10 10 -9 10
10 10 10 10 10 10 10 10
M 10 10 10 10 -9 10 -9 10 10
10 10 -9 -9 10 10 10 10
N 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10
R 10 10 -9 10 10 10 10 10 10
10 10 -9 -9 10 10 10 -9
S -9 10 10 10 10 10 10 10 10
10 10 -9 -9 10 -9 10 10
T -9 10 -9 10 -9 10 10 -9 10
-9 -9 10 10 -9 10 10 10
U -9 10 -9 10 -9 10 10 -9 10
-9 -9 10 10 -9 10 10 10
V 10 10 10 10 10 10 10 10 10
10 10 -9 -9 10 10 10 10
W 10 10 -9 10 -9 10 10 10 10
10 -9 10 10 10 10 10 10
X 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10

10
Amino Acid Scoring Schemes

Identity scoring
Genetic code scoring
Chemical similarity
Observed substitutions

11
Amino acid substitution matrix

Align protein structures or sequences of known
functionally similar proteins.
Look at the frequence of amino acid
substitutions.
Compute a log-odds ration matrix
M(ai,aj) log(observed freq (ai,aj) /expected
freq (ai,aj))
is more likely than random
- is less likely than random
0 is at the random base rate

12
PAM Percent Accepted Mutations

Globally align protein sequences (at least 85
identity)
Calculate the frequency of amino acid
substitutions
Divide by the normalized frequency of amino acids
-gt
log ratios.
A 1 PAM matrix specifies a unit of evolutionary
change 1 accepted point mutation per 100
residues.
Create a higher PAM matrix by multiplying the 1
PAM by itself.

13
PAM 250 matrix

A B C D E F G H I K L
M N P Q R S T V W Y Z
A 2 0 -2 0 0 -4 1 -1 -1 -1 -2
-1 0 1 0 -2 1 1 0 -6 -3 0
B 0 2 -4 3 2 -5 0 1 -2 1 -3
-2 2 -1 1 -1 0 0 -2 -5 -3 2
C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6
-5 -4 -3 -5 -4 0 -2 -2 -8 0 -5
D 0 3 -5 4 3 -6 1 1 -2 0 -4
-3 2 -1 2 -1 0 0 -2 -7 -4 3
E 0 2 -5 3 4 -5 0 1 -2 0 -3
-2 1 -1 2 -1 0 0 -2 -7 -4 3
F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2
0 -4 -5 -5 -4 -3 -3 -1 0 7 -5
G 1 0 -3 1 0 -5 5 -2 -3 -2 -4
-3 0 -1 -1 -3 1 0 -1 -7 -5 -1
H -1 1 -3 1 1 -2 -2 6 -2 0 -2
-2 2 0 3 2 -1 -1 -2 -3 0 2
I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2
2 -2 -2 -2 -2 -1 0 4 -5 -1 -2
K -1 1 -5 0 0 -5 -2 0 -2 5 -3
0 1 -1 1 3 0 0 -2 -3 -4 0
L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6
4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3
M -1 -2 -5 -3 -2 0 -3 -2 2 0 4
6 -2 -2 -1 0 -2 -1 2 -4 -2 -2
N 0 2 -4 2 1 -4 0 2 -2 1 -3
-2 2 -1 1 0 1 0 -2 -4 -2 1
P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3
-2 -1 6 0 0 1 0 -1 -6 -5 0
Q 0 1 -5 2 2 -5 -1 3 -2 1 -2
-1 1 0 4 1 -1 -1 -2 -5 -4 3
R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3
0 0 0 1 6 0 -1 -2 2 -4 0
S 1 0 0 0 0 -3 1 -1 -1 0 -3
-2 1 1 -1 0 2 1 -1 -2 -3 0
T 1 0 -2 0 0 -3 0 -1 0 0 -2
-1 0 0 -1 -1 1 3 0 -5 -3 -1

14
BLOSUM Block Substitution Matrix

Align ungapped protein sequences from the BLOCKS
database
Look at the the frequency of amino acid
substitutions.
Compute a log-odds ratio matrix (ie PAM
calculations)
Higher BLOSUM Cluster groups by varying
similarity and recalculate matrix.
Number following indicates percent identity
within the set, BLOSUM62 62 identity.

15
BLOSUM62 matrix

A B C D E F G H I K L
M N P Q R S T V W X Y Z
A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1
-1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1
B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
-3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2
C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1
-1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4
D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
-3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2
E -1 2 -4 2 5 -3 -2 0 -3 1 -3
-2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0
0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3
G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4
-3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2
H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3
-2 1 -2 0 0 -1 -2 -3 -2 -1 2 0
I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2
1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3
K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2
-1 0 -1 1 2 0 -1 -2 -3 -1 -2 1
L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3
M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2
5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2
N -2 1 -3 1 0 -3 0 1 -3 0 -3
-2 6 -2 0 0 1 0 -3 -4 -1 -2 0
P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3
-2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1
Q -1 0 -3 0 2 -3 -2 0 -3 1 -2
0 0 -1 5 1 0 -1 -2 -2 -1 -1 2
R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2
-1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0
S 1 0 -1 0 0 -2 0 -1 -2 0 -2
-1 1 -1 0 -1 4 1 -2 -3 -1 -2 0
T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1
-1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1

16
Which matrix

Blosum finds short, highly similar sequences.
Blosum usually best for local similarity
searches.
Blosum62 is default for blast.
BLOSUM80 BLOSUM62 BLOSUM45
PAM1 PAM120 PAM250
Less divergent More divergent

17
Which matrix

If sequences are thought to be related, a PAM250
is best.
When sequences are not known to be related,
PAM120 tends to give more sensitivity.
To pick up short segments of highly similar
sequences, PAM40 is a good choice.

18
Global Alignment

Optimal Alignment over the entire length of both
sequences.

19
Global Alignment

Needleman Wunsch Algorithm
Every residue of the two sequences has to
participate.
Guaranteed to calculate an Optimal similarity
score.
Cannot detect domains.

20
Local Alignment

Locates the highest scoring alignment regardless
of position and length.

21
Local Alignment

Smith Waterman Algorithm
Guaranteed to find all significant matches to a
given query.
Can find regions of strong similarity domains.
Computationally expensive.

22
Blast Basic Local Alignment Sequence Tool

Objective find all local regions of similarity
distinguishable from random.
Local alignments
Gaps permitted
Statistically sound, but no guarantee of
optimality
Fast
Less sensitive for (shorter) sequences.

23
BLAST Three step algorithm

Compile a list of high scoring words of length w
(w4 for proteins, w12 for nucleic acids)
Scan the word hits of score greater than
threshold, T
Extend word hit in both directions to find High
Scoring Pairs with scores greater than S.

24
BLAST Word size

A word is any short sequence lt 6 letters
Protein (1 - 2), nucleotide (1 - 6)
High word size results in
Faster
Less sensitive
More selective

25
Other BLAST programs

Query Database
BLASTN nucleic acid query nucleic acid
BLASTP protein query protein
BLASTX translated NA query protein
TBLASTN protein query translated NA
TBLASTX translated NA query translated NA

26
Other BLAST Programs
27
Other BLAST Programs
28
Other BLAST Programs
29
Blast Databases

Nucleotide Sequence Databases nr All
GenBankRefSeq NucleotidesEMBLDDBJPDB
sequences (but no EST, STS, GSS, or phase 0, 1 or
2 HTGS sequences). No longer "non-redundant".
est Database of GenBankEMBLDDBJ sequences from
EST Divisions est_human, est_mouse, est_others
gss Genome Survey Sequence, includes
single-pass genomic data, exon-trapped sequences,
and Alu PCR sequences. htgs Unfinished High
Throughput Genomic Sequences phases 0, 1 and 2
(finished, phase 3 HTG sequences are in nr) pat
Nucleotides from the Patent division of GenBank.
mito Database of mitochondrial sequences
vector Vector subset of GenBank(R), NCBI, in
ftp//ftp.ncbi.nih.gov/blast/db/ pdb Sequences
from the 3-dimensional structure from Brookhaven
Protein Data Bank GENOMES (yeast, E. coli,
Drosophila genome month All new or revised
GenBankEMBLDDBJPDB sequences released in the
last 30 days. alu Select Alu repeats from
REPBASE, suitable for masking Alu repeats from
query sequences.
dbsts Database of GenBankEMBLDDBJ sequences
from STS Divisions . chromosome Searches
Complete Genomes, Complete Chromosome, or contigs
form the NCBI Reference Sequence project..

30
Blast Databases

Peptide Sequence Databases
nr All non-redundant GenBank CDS
translationsRefSeq ProteinsPDBSwissProtPIRPRF
swissprot Last major release of the SWISS-PROT
protein sequence database (no updates) pat
Proteins from the Patent division of GenPept.
pdb Sequences derived from the 3-dimensional
structure from Brookhaven Protein Data Bank
Yeast yeast (Saccharomyces cerevisiae) genomic
CDS translations ecoli Escherichia coli
genomic CDS translations Drosophila genome
Drosophila genome proteins provided by Celera
and Berkeley Drosophila Genome Project (BDGP).
month All new or revised GenBank CDS
translationPDBSwissProtPIRPRF released in the
last 30 days.

31
Other Blast Programs

MegaBlast - optimized for aligning sequences that
differ slightly as a result of sequencing or
other similar "errors".
Uses a larger word size (16)
Is up to 10 times faster than more common
sequence similarity programs
able to efficiently handle much longer DNA
sequences
non-affine gapping parameters (open 0,
extension variable

32
Other Blast Programs

TraceBlast - optimized for cross species
comparisons.
word size (11)
Expect value (10)

33
Other BLAST programs

Gapped BLAST (BLAST 2.0)
extends words from no-gap to gap, generate gapped
alignments
PSI-BLAST
Position specific iterated BLAST, use gapped
BLAST, generate a Profile from multiple
iterations used instead of the input and Distance
Matrix.

34
Limitations of BLAST

Needs islands of strong homology
Limits on the combination of scoring and penalty
values
Variants (blastx, tblastn, tblastx) use 6-frame
translation, yet does miss sequences with frame
shifts, etc.
Finds and reports only local alignments.

35
Rules of Thumb

For short amino acid sequences (20 - 40), 50
identity happens by chance.
(Increase expect value and decrease word size.)
If A and B are homologous, and B and C are
homologous, then A and C are homologous, even if
you can not see it.
Locate and filter regions of low complexity.
Leads to false positive alignment (similarity
without homology).

36
FASTA Fast Alignment

Rapid global alignment
allows alignment to shift frames
not a strong mathmatical basis

37
FASTA

Show diagrams.

38
LALIGN

A FASTA derivative for local alignments
Compares two protein sequences to identify
regions of similarity
Will report several sequence alignments within a
given sequence
Works for internal repeats that are missed by
FASTA because of gaps.

39
Precomputed Alignments

Related sequences, related structures, related
articles, summaries, etc.
InterPro
Pfam
ProDom
Smart
etc.

40
(No Transcript)
41
(No Transcript)
42
BimCoreBioMolecular Computing Resource

Founded in 1992 as a subscription based
computing support service for Emory
bioinformatics based reasearch.
Bioinformatics is the MIS for molecular
biology.
Our mission is to serve as a human interface
between researchers and computing technology
enabling investigators to refine scope of
biological investigations and accelerate
discovery.

43
BimCoreBioMolecular Computing Resource

Sequence Analysis Molecular Modeling
Microarray Analysis
BimCore provides
Software and computational hardware (evaluate
software,
purchase Emory site licenses)
Training (workshops, courses and online
tutorials)
Collaborations (evaluation, direct approach,
training
and interpretation)
Computer expertise (programming, evaluate
needs,
direct purchase and set-up)

44
The Biology of Molecular BiologyRobert J. Huskey
http//www.people.virginia.edu/
rjh9u/humbiol.html
45
Sequence Analysis Facility

...ATATAA...GTA...ATGCTAGGCGCTTCTATCTTC
..................UACGAUCCGCGAAGAUAGAAG

DNA
UAC GAU CCG CGA AGA UAG AAG...
RNA protein
M L G A S I F ..
protein
46
Sequence Analysis Facility

Given a sequence (DNA or protein) search against
biological databases for similarities in type
and function.
Generate and review sequence alignments.
Discover evolutionary path of a DNA or protein
sequence (phylogeny).
Given a sequence of DNA (eg. ACGTGTGGG)
locate genes and mutations.
Assemble sequence fragments into larger
component sequence.

47
Molecular Modeling Center

Analyze protein structure, interpret the
mechanism of action.
Generate a model structure by Homology
Modeling.
Mutate a protein structure, predict affect on
the protein.
Design inhibition of protein function, drug
design and molecular docking. Emerson

48
Microarray Data Analysis Facility