Title: Bioinformatics 101 BimCore www.bimcore.emory.edu
1Bioinformatics 101BimCore www.bimcore.emory.edu
- Bioinformatics is the computational management,
analysis and dissemination of biological
information. - 3 main types of data genome sequences, protein
sequences and structure, and functional genomics
(expression data). - Genomics mapping, sequencing and analysis of
genomes (sequences, structural and functional). - Proteomics qualitative and quantitative
comparison of a proteome (complete set of
proteins of an organism) under different
conditions to unravel biological processes.
2Information available
- Databases
- Genome sequences - Flybase, GeneBank, EMBL,
- SwissProt, PDB, ESTs, BACs, etc.
- Experimental information
- Published information
- PubMed, MedLine
3How locate relevant information
- Text searches
- Query matches
- Sort by relevance
- gtgt sequence to sequence comparisons
4Definitions
- Homolog Sequences that share a common
ancestor may have similar function. - Paralogue Similar sequence within species, may
have similar function. - Orthologue Same sequence separated by a
speciation event, probably same function. - Analog Non-homolog proteins that have similar
folding or similar functional sites, which arose
through convergent evolution.
5Evolution and Alignment
- Similarity is an observable quantity (
identity). - Homology is a conclusion drawn from the data,
thus two genes share a common evolutionary
history. - Alignments reflect amino acid substitutions, and
insertions and deletions. - Certain regions are more conserved than others.
- In biomolecular sequences (DNA, RNA, AA seqs),
high sequence similarity usually implies
significant structural and / or functional
similarity.
6Sequence alignment based onEdit Distance
- Transforming one string into another by a series
of edit operations on the individual characters. - The operations allowed are insertion of a
character in the first string, deletion of a
character from the first string, or the
substitution of a character in the first string
with a character in the second string. - To limit the transforming operations to one
string, an insertion in one string can be
considered a deletion in the other string and
visa-versa.
7Edit Distance Example
- v-intner- vintner
- RIMDMDMMI RRRMRRR
- wri-t-ers writers
- Edit Distance 5 Edit Distance 6
- (R)eplacements 1, (I)nsertions/(D)eletions1,
(M)atches0 - Edit Distance (sometimes referred to as
Levenshtein distance, Sov. Phys. Dokl. 10707-10,
1966) is the minimum number of edit operations to
transform one string into the other.
8Similarity Example
- v-intner- vintner
- RIMDMDMMI RRRMRRR
- wri-t-ers writers
- Similarity 4 Similarity 1
- (R)eplacements 0, (I)nsertions/(D)eletions0,
(M)atches1 - Similarity is value of the alignment that
maximizes the total alignment value.
9DNA Scoring Scheme
- Default Scoring Matrix (Smith Waterman) for DNA
- A B C D G H K M N R
S T U V W X Y - A 10 -9 -9 10 -9 10 -9 10 10
10 -9 -9 -9 10 10 10 -9 - B -9 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 - C -9 10 10 -9 -9 10 -9 10 10
-9 10 -9 -9 10 -9 10 10 - D 10 10 -9 10 10 10 10 10 10
10 10 10 10 10 10 10 10 - G -9 10 -9 10 10 -9 10 -9 10
10 10 -9 -9 10 -9 10 -9 - H 10 10 10 10 -9 10 10 10 10
10 10 10 10 10 10 10 10 - K -9 10 -9 10 10 10 10 -9 10
10 10 10 10 10 10 10 10 - M 10 10 10 10 -9 10 -9 10 10
10 10 -9 -9 10 10 10 10 - N 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 - R 10 10 -9 10 10 10 10 10 10
10 10 -9 -9 10 10 10 -9 - S -9 10 10 10 10 10 10 10 10
10 10 -9 -9 10 -9 10 10 - T -9 10 -9 10 -9 10 10 -9 10
-9 -9 10 10 -9 10 10 10 - U -9 10 -9 10 -9 10 10 -9 10
-9 -9 10 10 -9 10 10 10 - V 10 10 10 10 10 10 10 10 10
10 10 -9 -9 10 10 10 10 - W 10 10 -9 10 -9 10 10 10 10
10 -9 10 10 10 10 10 10 - X 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10
10Amino Acid Scoring Schemes
- Identity scoring
- Genetic code scoring
- Chemical similarity
- Observed substitutions
-
11Amino acid substitution matrix
- Align protein structures or sequences of known
functionally similar proteins. - Look at the frequence of amino acid
substitutions. - Compute a log-odds ration matrix
- M(ai,aj) log(observed freq (ai,aj) /expected
freq (ai,aj)) - is more likely than random
- - is less likely than random
- 0 is at the random base rate
12PAM Percent Accepted Mutations
- Globally align protein sequences (at least 85
identity) - Calculate the frequency of amino acid
substitutions - Divide by the normalized frequency of amino acids
-gt - log ratios.
- A 1 PAM matrix specifies a unit of evolutionary
change 1 accepted point mutation per 100
residues. - Create a higher PAM matrix by multiplying the 1
PAM by itself.
13PAM 250 matrix
- A B C D E F G H I K L
M N P Q R S T V W Y Z - A 2 0 -2 0 0 -4 1 -1 -1 -1 -2
-1 0 1 0 -2 1 1 0 -6 -3 0 - B 0 2 -4 3 2 -5 0 1 -2 1 -3
-2 2 -1 1 -1 0 0 -2 -5 -3 2 - C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6
-5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 - D 0 3 -5 4 3 -6 1 1 -2 0 -4
-3 2 -1 2 -1 0 0 -2 -7 -4 3 - E 0 2 -5 3 4 -5 0 1 -2 0 -3
-2 1 -1 2 -1 0 0 -2 -7 -4 3 - F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2
0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 - G 1 0 -3 1 0 -5 5 -2 -3 -2 -4
-3 0 -1 -1 -3 1 0 -1 -7 -5 -1 - H -1 1 -3 1 1 -2 -2 6 -2 0 -2
-2 2 0 3 2 -1 -1 -2 -3 0 2 - I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2
2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 - K -1 1 -5 0 0 -5 -2 0 -2 5 -3
0 1 -1 1 3 0 0 -2 -3 -4 0 - L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6
4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 - M -1 -2 -5 -3 -2 0 -3 -2 2 0 4
6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 - N 0 2 -4 2 1 -4 0 2 -2 1 -3
-2 2 -1 1 0 1 0 -2 -4 -2 1 - P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3
-2 -1 6 0 0 1 0 -1 -6 -5 0 - Q 0 1 -5 2 2 -5 -1 3 -2 1 -2
-1 1 0 4 1 -1 -1 -2 -5 -4 3 - R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3
0 0 0 1 6 0 -1 -2 2 -4 0 - S 1 0 0 0 0 -3 1 -1 -1 0 -3
-2 1 1 -1 0 2 1 -1 -2 -3 0 - T 1 0 -2 0 0 -3 0 -1 0 0 -2
-1 0 0 -1 -1 1 3 0 -5 -3 -1
14BLOSUM Block Substitution Matrix
- Align ungapped protein sequences from the BLOCKS
database - Look at the the frequency of amino acid
substitutions. - Compute a log-odds ratio matrix (ie PAM
calculations) - Higher BLOSUM Cluster groups by varying
similarity and recalculate matrix. - Number following indicates percent identity
within the set, BLOSUM62 62 identity.
15BLOSUM62 matrix
- A B C D E F G H I K L
M N P Q R S T V W X Y Z - A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1
-1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 - B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
-3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 - C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1
-1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 - D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
-3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 - E -1 2 -4 2 5 -3 -2 0 -3 1 -3
-2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 - F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0
0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 - G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4
-3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 - H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3
-2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 - I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2
1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 - K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2
-1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 - L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 - M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2
5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 - N -2 1 -3 1 0 -3 0 1 -3 0 -3
-2 6 -2 0 0 1 0 -3 -4 -1 -2 0 - P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3
-2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 - Q -1 0 -3 0 2 -3 -2 0 -3 1 -2
0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 - R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2
-1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 - S 1 0 -1 0 0 -2 0 -1 -2 0 -2
-1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 - T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1
-1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1
16Which matrix
- Blosum finds short, highly similar sequences.
- Blosum usually best for local similarity
searches. - Blosum62 is default for blast.
- BLOSUM80 BLOSUM62 BLOSUM45
- PAM1 PAM120 PAM250
- Less divergent More divergent
17Which matrix
- If sequences are thought to be related, a PAM250
is best. - When sequences are not known to be related,
PAM120 tends to give more sensitivity. - To pick up short segments of highly similar
sequences, PAM40 is a good choice.
18 Global Alignment
- Optimal Alignment over the entire length of both
sequences.
19Global Alignment
- Needleman Wunsch Algorithm
- Every residue of the two sequences has to
participate. - Guaranteed to calculate an Optimal similarity
score. - Cannot detect domains.
20Local Alignment
- Locates the highest scoring alignment regardless
of position and length.
21Local Alignment
- Smith Waterman Algorithm
- Guaranteed to find all significant matches to a
given query. - Can find regions of strong similarity domains.
- Computationally expensive.
22Blast Basic Local Alignment Sequence Tool
- Objective find all local regions of similarity
distinguishable from random. - Local alignments
- Gaps permitted
- Statistically sound, but no guarantee of
optimality - Fast
- Less sensitive for (shorter) sequences.
23BLAST Three step algorithm
- Compile a list of high scoring words of length w
- (w4 for proteins, w12 for nucleic acids)
- Scan the word hits of score greater than
threshold, T - Extend word hit in both directions to find High
Scoring Pairs with scores greater than S.
24BLAST Word size
- A word is any short sequence lt 6 letters
- Protein (1 - 2), nucleotide (1 - 6)
- High word size results in
- Faster
- Less sensitive
- More selective
25Other BLAST programs
- Query Database
- BLASTN nucleic acid query nucleic acid
- BLASTP protein query protein
- BLASTX translated NA query protein
- TBLASTN protein query translated NA
- TBLASTX translated NA query translated NA
26Other BLAST Programs
27Other BLAST Programs
28Other BLAST Programs
29Blast Databases
- Nucleotide Sequence Databases nr All
GenBankRefSeq NucleotidesEMBLDDBJPDB
sequences (but no EST, STS, GSS, or phase 0, 1 or
2 HTGS sequences). No longer "non-redundant".
est Database of GenBankEMBLDDBJ sequences from
EST Divisions est_human, est_mouse, est_others
gss Genome Survey Sequence, includes
single-pass genomic data, exon-trapped sequences,
and Alu PCR sequences. htgs Unfinished High
Throughput Genomic Sequences phases 0, 1 and 2
(finished, phase 3 HTG sequences are in nr) pat
Nucleotides from the Patent division of GenBank.
mito Database of mitochondrial sequences
vector Vector subset of GenBank(R), NCBI, in
ftp//ftp.ncbi.nih.gov/blast/db/ pdb Sequences
from the 3-dimensional structure from Brookhaven
Protein Data Bank GENOMES (yeast, E. coli,
Drosophila genome month All new or revised
GenBankEMBLDDBJPDB sequences released in the
last 30 days. alu Select Alu repeats from
REPBASE, suitable for masking Alu repeats from
query sequences. - dbsts Database of GenBankEMBLDDBJ sequences
from STS Divisions . chromosome Searches
Complete Genomes, Complete Chromosome, or contigs
form the NCBI Reference Sequence project..
30Blast Databases
- Peptide Sequence Databases
- nr All non-redundant GenBank CDS
translationsRefSeq ProteinsPDBSwissProtPIRPRF
swissprot Last major release of the SWISS-PROT
protein sequence database (no updates) pat
Proteins from the Patent division of GenPept.
pdb Sequences derived from the 3-dimensional
structure from Brookhaven Protein Data Bank
Yeast yeast (Saccharomyces cerevisiae) genomic
CDS translations ecoli Escherichia coli
genomic CDS translations Drosophila genome
Drosophila genome proteins provided by Celera
and Berkeley Drosophila Genome Project (BDGP).
month All new or revised GenBank CDS
translationPDBSwissProtPIRPRF released in the
last 30 days.
31Other Blast Programs
- MegaBlast - optimized for aligning sequences that
differ slightly as a result of sequencing or
other similar "errors". - Uses a larger word size (16)
- Is up to 10 times faster than more common
sequence similarity programs - able to efficiently handle much longer DNA
sequences - non-affine gapping parameters (open 0,
extension variable
32Other Blast Programs
- TraceBlast - optimized for cross species
comparisons. - word size (11)
- Expect value (10)
33Other BLAST programs
- Gapped BLAST (BLAST 2.0)
- extends words from no-gap to gap, generate gapped
alignments - PSI-BLAST
- Position specific iterated BLAST, use gapped
BLAST, generate a Profile from multiple
iterations used instead of the input and Distance
Matrix.
34Limitations of BLAST
- Needs islands of strong homology
- Limits on the combination of scoring and penalty
values - Variants (blastx, tblastn, tblastx) use 6-frame
translation, yet does miss sequences with frame
shifts, etc. - Finds and reports only local alignments.
35Rules of Thumb
- For short amino acid sequences (20 - 40), 50
identity happens by chance. - (Increase expect value and decrease word size.)
- If A and B are homologous, and B and C are
homologous, then A and C are homologous, even if
you can not see it. - Locate and filter regions of low complexity.
Leads to false positive alignment (similarity
without homology).
36FASTA Fast Alignment
- Rapid global alignment
- allows alignment to shift frames
- not a strong mathmatical basis
37FASTA
38LALIGN
- A FASTA derivative for local alignments
- Compares two protein sequences to identify
regions of similarity - Will report several sequence alignments within a
given sequence - Works for internal repeats that are missed by
FASTA because of gaps.
39Precomputed Alignments
- Related sequences, related structures, related
articles, summaries, etc. - InterPro
- Pfam
- ProDom
- Smart
- etc.
40(No Transcript)
41(No Transcript)
42BimCoreBioMolecular Computing Resource
- Founded in 1992 as a subscription based
computing support service for Emory
bioinformatics based reasearch. - Bioinformatics is the MIS for molecular
biology. - Our mission is to serve as a human interface
between researchers and computing technology
enabling investigators to refine scope of
biological investigations and accelerate
discovery.
43BimCoreBioMolecular Computing Resource
- Sequence Analysis Molecular Modeling
- Microarray Analysis
- BimCore provides
- Software and computational hardware (evaluate
software, - purchase Emory site licenses)
- Training (workshops, courses and online
tutorials) - Collaborations (evaluation, direct approach,
training - and interpretation)
- Computer expertise (programming, evaluate
needs, - direct purchase and set-up)
44The Biology of Molecular BiologyRobert J. Huskey
http//www.people.virginia.edu/
rjh9u/humbiol.html
45Sequence Analysis Facility
- ...ATATAA...GTA...ATGCTAGGCGCTTCTATCTTC
- ..................UACGAUCCGCGAAGAUAGAAG
DNA
UAC GAU CCG CGA AGA UAG AAG...
RNA protein
M L G A S I F ..
protein
46Sequence Analysis Facility
- Given a sequence (DNA or protein) search against
- biological databases for similarities in type
and function. - Generate and review sequence alignments.
- Discover evolutionary path of a DNA or protein
- sequence (phylogeny).
- Given a sequence of DNA (eg. ACGTGTGGG)
- locate genes and mutations.
- Assemble sequence fragments into larger
component sequence.
47Molecular Modeling Center
- Analyze protein structure, interpret the
mechanism of action. - Generate a model structure by Homology
Modeling. - Mutate a protein structure, predict affect on
the protein. - Design inhibition of protein function, drug
design and molecular docking. Emerson
48Microarray Data Analysis Facility
- Determine the expression level of every gene in
an organism, - tissue or cell culture.
- Monitor the changes in expression levels in
response to - environmental changes, such as stress,
disease, drug. - Lead to a study and understanding of the
biological response.
49- Sequence Analysis Facility
- Given a sequence (DNA or protein) search against
biological databases for - similarities in type and function.
- Generate and review sequence alignments.
- Discover evolutionary path of a DNA or protein
sequence (phylogeny). - Given a sequence of DNA (eg. ACGTGTGGG) locate
genes and mutations. - Assemble sequence fragments into larger
component sequence. - Molecular Modeling Center
- Generate a model structure by Homology Modeling.
- Mutate a protein structure, predict affect on
the protein. - Design inhibition of protein function, drug
design and molecular docking. - Microarray Data Analysis Facility
- Determine the expression level of every gene in
an organism, tissue or cell - culture.
- Monitor the changes in expression levels in
response to environmental - changes, such as stress, disease, drug.
- Lead to a study and understanding of the
biological response.