BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis

About This Presentation

Title:

BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis

Description:

UWC - Program in Applied ... Refers especially to the computational analysis of large datasets of DNA, ... Smith and Waterman. FASTA (Pearson) BLAST ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 60

Provided by: Bioc82

Category:

more less

Transcript and Presenter's Notes

Title: BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis

1
BIO341Gene DiscoverySection 4Bioinformatics
and genome analysis

Jasper Rees
Department of Biochemistry, UWC
www.biotechnology.uwc.ac.za/teaching/BIO341

2
Bioinformatics and genome analysis

Bioinformatics - the analysis of biological
information usually applied to molecular data,
though formally covering all biological systems.
Refers especially to the computational analysis
of large datasets of DNA, protein and structural
data

3
Simple gene analysis

Restriction maps
Plasmid maps
ORFs and coding sequences
Database searching
2 way comparison of sequences
Multiple sequence alignments

4
More complex analysis

Sequencing project assembly
ORF prediction on statistical analysis of DNA
sequence
Domain identification
Structure comparison
Structure prediction
Promoter and splice junction prediction
Genome analysis

5
Pairwise alignments

Comparison of two sequences, DNA-DNA or
Protein-Protein
Identification of best matching region
Local alignment of best match and surrounding
homology
Score for similarity and gapping
Display as alignments or dotplots
Computational requirement increase as product of
length of sequences (L1 x L2)

6
Algorithms

Most frequently used methods of alignment are
Needleman and Wunch
Smith and Waterman
FASTA (Pearson)
BLAST
Blast and Fasta make approximations to achieve
speed of alignment/database searching

7
Gap insertion costs

Score for insertion of a gap into a sequence
Score for extension of a gap into a sequence
Score for end alignments
Vary insertion/extension parameters to optimise
alignments depending on the similarity of the
sequences
Lower penalties gives more gaps

8
Similarity Scoring Matrices

Matrices used to score replacement of characters
in a sequence
DNA matrix, can be binary, or more complex
Protein matrix, either binary, or based on
mutation analysis, biophysical data or other
easy of replacement options

9
Identity DNA scoring matrix

1 for identity, 0 for mismatch

10
Complex DNA scoring matrix

For Example
4 for identity, 2 for transition, 0 for
transversion

11
Protein Matrices

Identity matrix 1 for identify, 0 for mismatch
Mutation data matrix derived from the analysis
of the easy of substitution of one amino acid for
another in protein evolution
Biophysical data matrix based on analysis of
energy cost of substituting one residue for
another in protein structures, solubility
analysis, etc

12
Mutation Data Matrix

PAM - Point Acceptable Mutation various levels,
differing sensitivities, derived from families of
ancient, globular proteins
Blosum - constructed directly from multiple
sequence alignments. Gives results closer to
observed relationships. But no evolutionary model.

13
PAM 250 Matrix (partial)

Positive scores allow alignment
Negative scores indicate poor alignment
Larger is better (or worse)
Scale is Logarithmic

14
BLOSUM Matrix

Positive scores allow alignment
Negative scores indicate poor alignment
Larger is better (or worse)
Scale is Logarithmic

15
Matrix Comparison

Selection of matrix depends on extent of
similarity expected or of interest

16
Blast Resources

Blast Information and resource links
Blast Tutorial
Statistics of Similarity Searching
General Rules for understanding results
Glossary of technical terms

17
BLAST

Basic Local Alignment and Search Tool
For rapid searching of pre-processed databases
5 search strategies
Selection of databases and search targets
Calculated best local match
Gapped alignments
Statistics and histograms

18
(No Transcript)
19
(No Transcript)
20
Scoring an alignment
21
BLASTN

Compares Nucleotide sequence with Nucleotide
Sequence Database
Complete range of DNA databases
Use to identify identity or close relationship
between query sequence and database.
Not especially sensitive for identification of
homology
Good for identity matching

22
Nucleotide databases 1

Nr - non redundant - Database of all published
and much unpublished DNA sequence data, merged
from Genbank, EMBL, DDBJ and PDB
Month - data added to nr within the last month
Gss - genome sequence survey preliminary genomic
sequencing project data
HTGS - high throughput genome sequences,
unfinished genomic sequences, assembled

23
Nucleotide databases 2

EST - Expressed sequence tags
Human EST - Human Expressed sequence tags
Mouse EST - Mouse Expressed sequence tags
Other EST - Expressed sequence tags from all
other species
Other databases E.coli, Yeast, Mito, pbd, kabat,
patents, vector, alu

24
BLASTP

Protein sequence against protein database
Complete range of Protein databases
Use to identify identity and distant relationship
between query sequence and database.
Sensitive for identification of homology
Implementation as PSI-BLAST improves sensitivity
Good for identity matching

25
Protein Databases

Nr - non redundant - all available data from
protein and DNA sequence
Month - most recent 30 days updates
Swissprot - curated and annotated database
PDB - protein sequences for 3D structures
E.coli - E.coli
Yeast - Saccharomyces
Kabat - immunological sequences

26
TBLASTN

Compares a protein query sequence against a
nucleotide sequence database translated into all
reading frames.
Sensitive ( as with BLASTP) for homology and
identity
Used to identify possible coding sequences and
homologies
Especially useful with genome and EST data

27
BLASTX

Compares a nucleotide query sequence translated
in all reading frames against a protein sequence
database.
Sensitive for homology and identity matching
You could use this option to find potential
translation products of an unknown nucleotide
sequence.
Useful for new DNA sequences

28
TBLASTX

Compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
Very useful for homology searching
Less useful with identities (matches everything 6
times!)
Helps to select out conserved coding sequence
from non-coding background
Especially useful for cross-species analysis with
genomic, cDNA and EST data

29
Blast Inputs

Input sequence or accession number
Choice of search program
Choice of subsequence
Choice of database
Choice of codon table

30
Blast Input Page
31
Blast Matrix choice
32
Species/Genus/Phyla selection
33
Further species selection
34
Boolean operators

Logical operators used to specify selection
AND, OR, NOT, (IS, BEFORE, NEXT TO)
Use to get greater specificity of selection
For example
Mammal NOT Human
Limits selection to all mammals, but excluding
humans.
Vertebrate NOT mammal
Would select all non-mammalian vertebrates

35
Format Submission
36
Email submission
37
Blast Output - summary
Input sequence, name and size
DATABASE - number of sequences, total length
38
Blast Output - Graphical
Regions of homology in matching sequences Colour
coded for scores
Input sequence scale
39
Blast Output - lists and stats
Database entry
Description of entry
Score
Stats
40
Blast Statistics

Statistical values calculated relative to the
size of the database
Depend on the length of the match
Values expressed as exponentials
3e-17 is 3 x 10-17
Smaller E value is better match, because
statistically less likely to be a random event
Exact match has E0 (cannot be random)
Values greater than 10-5 are questionable

41
Blast alignments - exact match
42
Blast alignments - homology
43
BLAST-2-Sequences

Pairwise comparison of two sequences only
All 5 versions of BLAST available
(so all combinations of DNA/Protein possible)
Graphical display
Sequence alignments
Statistical significance
calculated from database
size

44
Statistical Significance and Histograms

Start with Tutorial on
The Statistics of Sequence Similarity Scores
Covers matrices, gapping, global vs local
alignment, statistical significance.

45
Multiple Sequence alignments

Alignment of three more sequences together
Cannot do alignments simultaneously (excessively
large computational problem)
So various options used to develop a rapid
strategy to align sequences
Best option to align all sequences as pairs, then
build multiple sequence alignments added most
closely related sequences in order

46
Clustal approach to MSA

Compare all sequences with each other
Pair each sequence with the closest partner
Align closest partners
Align next closest partner to create groups
Align groups of sequences until completed
Build phylogenetic tree
Efficient method because only do pairwise
alignments, and only align closest pairs.

47
Outputs from MSA analysis

Sequence alignment
Or frequency matrix (Profile)
Similarity plot
Phylogenetic tree
Applies to all programs used for MSA

48
Frequency matrix or Profiles

First matrix created from aligned protein
Multiplied by mutation data matric (PAM or
BLOSUM)
Creates a matrix which is a frequency weighted
matrix specific to the alignment of sequences
Provides very sensitive alignment tool
Can do similarly for DNA sequences

49
DWNN domain alignment
50
Phylogenetic trees

Way to show the best predicted evolutionary
relationship between aligned sequences
Confidence level depends on method used
Should relate to evolutionary distance between
sequences
Display distances as length and position of
branches
Should show up orthologs and paralogs
Need to root trees correctly for them to give
correct picture
Very distance sequences hard to be certain of
order

51
DWNN as a phylogenetic marker
52
PSI-BLAST

Position Specific Iteration BLAST
Starts with a BLASTP search
Generates a set of matches
Select matches above a threshold
Align sequences scoring above threshold
Create a frequency matrix from this alignment
Search database with frequency matrix
Repeat 1-5 until no new sequences added above
threshold level

53
Advantages of PSI-BLAST

Generates an alignment from single starting
sequence
Creates a specific matrix for each search
strategy
Final matrix should be the same for any family of
proteins, which ever the starting sequence used
for the search
Much more sensitive then BLASTP alone

54
Current disadvantages of PSI-BLAST

Does not show alignment used to generate matrix
Does not show matrix
Cannot generate final sequence alignment
May generate several alignments at one time if
have several domains in the the protein
Only uses BLASTP, not TBLASTN or BLASTX, so
databases restricted to protein

55
Databases of genomics data

Databases of genomic sequence data and
predicted/known genes (eg NCBI Genomes)
Annotated databases, integrating genetic map,
physical map (clones), sequence data, known
genes, and predicted genes
Databases based on AceDB
Integration of physical and genetic mapping data
Common interface for genomics data

56
Databases of research papers

Many different sources
NCBI PubMed is major site of medical and
molecular data. But is missing many plant and
agricultural papers.
Agricola Agricultural/Biological
Current Contents Everything!
www.sciencedirect.com journals from Elselvier
press, currently free to UWC
Others at UWC library web site

57
Search Engines

Various types and strategies
Web based spiders, crawlers etc (Google, Excite,
Yahoo, Altavista)
Database based PubMed etc
PubMed provides comprehensive indexing
Internally compared to give related references
Linked access to sequences and literature sites

58
Search Strategies

Need to chose keywords carefully
Use author names when possible
Can sometimes select by dates
Use review as a search term when appropriate
Add more terms to get greater selectivity
Avoid general terms, like cell human gene
Use Boolean operators (and, or, not)
Look for related articles (in PubMed) based on
internal text comparison to find most related
papers

59
Genome annotation engines

Input DNA sequence data
Search with databases and predictive methods to
identify possible coding sequences, promoters,
splice junctions, exons, poly A sites, tRNA
genes, repeat sequences,
Some sites do all of this, some dedicated to one
type of analysis (eg promoters)

Write a Comment

User Comments (0)

About PowerShow.com

BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis - PowerPoint PPT Presentation

BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis

UWC - Program in Applied ... Refers especially to the computational analysis of large datasets of DNA, ... Smith and Waterman. FASTA (Pearson) BLAST ... – PowerPoint PPT presentation