Title: BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis
1BIO341Gene DiscoverySection 4Bioinformatics
and genome analysis
- Jasper Rees
- Department of Biochemistry, UWC
- www.biotechnology.uwc.ac.za/teaching/BIO341
2Bioinformatics and genome analysis
- Bioinformatics - the analysis of biological
information usually applied to molecular data,
though formally covering all biological systems.
Refers especially to the computational analysis
of large datasets of DNA, protein and structural
data
3Simple gene analysis
- Restriction maps
- Plasmid maps
- ORFs and coding sequences
- Database searching
- 2 way comparison of sequences
- Multiple sequence alignments
4More complex analysis
- Sequencing project assembly
- ORF prediction on statistical analysis of DNA
sequence - Domain identification
- Structure comparison
- Structure prediction
- Promoter and splice junction prediction
- Genome analysis
5Pairwise alignments
- Comparison of two sequences, DNA-DNA or
Protein-Protein - Identification of best matching region
- Local alignment of best match and surrounding
homology - Score for similarity and gapping
- Display as alignments or dotplots
- Computational requirement increase as product of
length of sequences (L1 x L2)
6Algorithms
- Most frequently used methods of alignment are
- Needleman and Wunch
- Smith and Waterman
- FASTA (Pearson)
- BLAST
- Blast and Fasta make approximations to achieve
speed of alignment/database searching
7Gap insertion costs
- Score for insertion of a gap into a sequence
- Score for extension of a gap into a sequence
- Score for end alignments
- Vary insertion/extension parameters to optimise
alignments depending on the similarity of the
sequences - Lower penalties gives more gaps
8Similarity Scoring Matrices
- Matrices used to score replacement of characters
in a sequence - DNA matrix, can be binary, or more complex
- Protein matrix, either binary, or based on
mutation analysis, biophysical data or other
easy of replacement options
9Identity DNA scoring matrix
- 1 for identity, 0 for mismatch
10Complex DNA scoring matrix
- For Example
- 4 for identity, 2 for transition, 0 for
transversion
11Protein Matrices
- Identity matrix 1 for identify, 0 for mismatch
- Mutation data matrix derived from the analysis
of the easy of substitution of one amino acid for
another in protein evolution - Biophysical data matrix based on analysis of
energy cost of substituting one residue for
another in protein structures, solubility
analysis, etc
12Mutation Data Matrix
- PAM - Point Acceptable Mutation various levels,
differing sensitivities, derived from families of
ancient, globular proteins - Blosum - constructed directly from multiple
sequence alignments. Gives results closer to
observed relationships. But no evolutionary model.
13PAM 250 Matrix (partial)
- Positive scores allow alignment
- Negative scores indicate poor alignment
- Larger is better (or worse)
- Scale is Logarithmic
14BLOSUM Matrix
- Positive scores allow alignment
- Negative scores indicate poor alignment
- Larger is better (or worse)
- Scale is Logarithmic
15Matrix Comparison
- Selection of matrix depends on extent of
similarity expected or of interest
16Blast Resources
- Blast Information and resource links
- Blast Tutorial
- Statistics of Similarity Searching
- General Rules for understanding results
- Glossary of technical terms
17BLAST
- Basic Local Alignment and Search Tool
- For rapid searching of pre-processed databases
- 5 search strategies
- Selection of databases and search targets
- Calculated best local match
- Gapped alignments
- Statistics and histograms
18(No Transcript)
19(No Transcript)
20Scoring an alignment
21BLASTN
- Compares Nucleotide sequence with Nucleotide
Sequence Database - Complete range of DNA databases
- Use to identify identity or close relationship
between query sequence and database. - Not especially sensitive for identification of
homology - Good for identity matching
22Nucleotide databases 1
- Nr - non redundant - Database of all published
and much unpublished DNA sequence data, merged
from Genbank, EMBL, DDBJ and PDB - Month - data added to nr within the last month
- Gss - genome sequence survey preliminary genomic
sequencing project data - HTGS - high throughput genome sequences,
unfinished genomic sequences, assembled
23Nucleotide databases 2
- EST - Expressed sequence tags
- Human EST - Human Expressed sequence tags
- Mouse EST - Mouse Expressed sequence tags
- Other EST - Expressed sequence tags from all
other species - Other databases E.coli, Yeast, Mito, pbd, kabat,
patents, vector, alu
24BLASTP
- Protein sequence against protein database
- Complete range of Protein databases
- Use to identify identity and distant relationship
between query sequence and database. - Sensitive for identification of homology
- Implementation as PSI-BLAST improves sensitivity
- Good for identity matching
25Protein Databases
- Nr - non redundant - all available data from
protein and DNA sequence - Month - most recent 30 days updates
- Swissprot - curated and annotated database
- PDB - protein sequences for 3D structures
- E.coli - E.coli
- Yeast - Saccharomyces
- Kabat - immunological sequences
26TBLASTN
- Compares a protein query sequence against a
nucleotide sequence database translated into all
reading frames. - Sensitive ( as with BLASTP) for homology and
identity - Used to identify possible coding sequences and
homologies - Especially useful with genome and EST data
27BLASTX
- Compares a nucleotide query sequence translated
in all reading frames against a protein sequence
database. - Sensitive for homology and identity matching
- You could use this option to find potential
translation products of an unknown nucleotide
sequence. - Useful for new DNA sequences
28TBLASTX
- Compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database. - Very useful for homology searching
- Less useful with identities (matches everything 6
times!) - Helps to select out conserved coding sequence
from non-coding background - Especially useful for cross-species analysis with
genomic, cDNA and EST data
29Blast Inputs
- Input sequence or accession number
- Choice of search program
- Choice of subsequence
- Choice of database
- Choice of codon table
30Blast Input Page
31Blast Matrix choice
32Species/Genus/Phyla selection
33Further species selection
34Boolean operators
- Logical operators used to specify selection
- AND, OR, NOT, (IS, BEFORE, NEXT TO)
- Use to get greater specificity of selection
- For example
- Mammal NOT Human
- Limits selection to all mammals, but excluding
humans. - Vertebrate NOT mammal
- Would select all non-mammalian vertebrates
35Format Submission
36Email submission
37Blast Output - summary
Input sequence, name and size
DATABASE - number of sequences, total length
38Blast Output - Graphical
Regions of homology in matching sequences Colour
coded for scores
Input sequence scale
39Blast Output - lists and stats
Database entry
Description of entry
Score
Stats
40Blast Statistics
- Statistical values calculated relative to the
size of the database - Depend on the length of the match
- Values expressed as exponentials
- 3e-17 is 3 x 10-17
- Smaller E value is better match, because
statistically less likely to be a random event - Exact match has E0 (cannot be random)
- Values greater than 10-5 are questionable
41Blast alignments - exact match
42Blast alignments - homology
43BLAST-2-Sequences
- Pairwise comparison of two sequences only
- All 5 versions of BLAST available
- (so all combinations of DNA/Protein possible)
- Graphical display
- Sequence alignments
- Statistical significance
- calculated from database
- size
44Statistical Significance and Histograms
- Start with Tutorial on
- The Statistics of Sequence Similarity Scores
- Covers matrices, gapping, global vs local
alignment, statistical significance.
45Multiple Sequence alignments
- Alignment of three more sequences together
- Cannot do alignments simultaneously (excessively
large computational problem) - So various options used to develop a rapid
strategy to align sequences - Best option to align all sequences as pairs, then
build multiple sequence alignments added most
closely related sequences in order
46Clustal approach to MSA
- Compare all sequences with each other
- Pair each sequence with the closest partner
- Align closest partners
- Align next closest partner to create groups
- Align groups of sequences until completed
- Build phylogenetic tree
- Efficient method because only do pairwise
alignments, and only align closest pairs.
47Outputs from MSA analysis
- Sequence alignment
- Or frequency matrix (Profile)
- Similarity plot
- Phylogenetic tree
- Applies to all programs used for MSA
48Frequency matrix or Profiles
- First matrix created from aligned protein
- Multiplied by mutation data matric (PAM or
BLOSUM) - Creates a matrix which is a frequency weighted
matrix specific to the alignment of sequences - Provides very sensitive alignment tool
- Can do similarly for DNA sequences
49DWNN domain alignment
50Phylogenetic trees
- Way to show the best predicted evolutionary
relationship between aligned sequences - Confidence level depends on method used
- Should relate to evolutionary distance between
sequences - Display distances as length and position of
branches - Should show up orthologs and paralogs
- Need to root trees correctly for them to give
correct picture - Very distance sequences hard to be certain of
order
51DWNN as a phylogenetic marker
52PSI-BLAST
- Position Specific Iteration BLAST
- Starts with a BLASTP search
- Generates a set of matches
- Select matches above a threshold
- Align sequences scoring above threshold
- Create a frequency matrix from this alignment
- Search database with frequency matrix
- Repeat 1-5 until no new sequences added above
threshold level
53Advantages of PSI-BLAST
- Generates an alignment from single starting
sequence - Creates a specific matrix for each search
strategy - Final matrix should be the same for any family of
proteins, which ever the starting sequence used
for the search - Much more sensitive then BLASTP alone
54Current disadvantages of PSI-BLAST
- Does not show alignment used to generate matrix
- Does not show matrix
- Cannot generate final sequence alignment
- May generate several alignments at one time if
have several domains in the the protein - Only uses BLASTP, not TBLASTN or BLASTX, so
databases restricted to protein
55Databases of genomics data
- Databases of genomic sequence data and
predicted/known genes (eg NCBI Genomes) - Annotated databases, integrating genetic map,
physical map (clones), sequence data, known
genes, and predicted genes - Databases based on AceDB
- Integration of physical and genetic mapping data
- Common interface for genomics data
56Databases of research papers
- Many different sources
- NCBI PubMed is major site of medical and
molecular data. But is missing many plant and
agricultural papers. - Agricola Agricultural/Biological
- Current Contents Everything!
- www.sciencedirect.com journals from Elselvier
press, currently free to UWC - Others at UWC library web site
57Search Engines
- Various types and strategies
- Web based spiders, crawlers etc (Google, Excite,
Yahoo, Altavista) - Database based PubMed etc
- PubMed provides comprehensive indexing
- Internally compared to give related references
- Linked access to sequences and literature sites
58Search Strategies
- Need to chose keywords carefully
- Use author names when possible
- Can sometimes select by dates
- Use review as a search term when appropriate
- Add more terms to get greater selectivity
- Avoid general terms, like cell human gene
- Use Boolean operators (and, or, not)
- Look for related articles (in PubMed) based on
internal text comparison to find most related
papers
59Genome annotation engines
- Input DNA sequence data
- Search with databases and predictive methods to
identify possible coding sequences, promoters,
splice junctions, exons, poly A sites, tRNA
genes, repeat sequences, - Some sites do all of this, some dedicated to one
type of analysis (eg promoters)