BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis

Description:

UWC - Program in Applied ... Refers especially to the computational analysis of large datasets of DNA, ... Smith and Waterman. FASTA (Pearson) BLAST ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 60
Provided by: Bioc82
Category:

less

Transcript and Presenter's Notes

Title: BIO341 Gene Discovery Section 4 Bioinformatics and genome analysis


1
BIO341Gene DiscoverySection 4Bioinformatics
and genome analysis
  • Jasper Rees
  • Department of Biochemistry, UWC
  • www.biotechnology.uwc.ac.za/teaching/BIO341

2
Bioinformatics and genome analysis
  • Bioinformatics - the analysis of biological
    information usually applied to molecular data,
    though formally covering all biological systems.
    Refers especially to the computational analysis
    of large datasets of DNA, protein and structural
    data

3
Simple gene analysis
  • Restriction maps
  • Plasmid maps
  • ORFs and coding sequences
  • Database searching
  • 2 way comparison of sequences
  • Multiple sequence alignments

4
More complex analysis
  • Sequencing project assembly
  • ORF prediction on statistical analysis of DNA
    sequence
  • Domain identification
  • Structure comparison
  • Structure prediction
  • Promoter and splice junction prediction
  • Genome analysis

5
Pairwise alignments
  • Comparison of two sequences, DNA-DNA or
    Protein-Protein
  • Identification of best matching region
  • Local alignment of best match and surrounding
    homology
  • Score for similarity and gapping
  • Display as alignments or dotplots
  • Computational requirement increase as product of
    length of sequences (L1 x L2)

6
Algorithms
  • Most frequently used methods of alignment are
  • Needleman and Wunch
  • Smith and Waterman
  • FASTA (Pearson)
  • BLAST
  • Blast and Fasta make approximations to achieve
    speed of alignment/database searching

7
Gap insertion costs
  • Score for insertion of a gap into a sequence
  • Score for extension of a gap into a sequence
  • Score for end alignments
  • Vary insertion/extension parameters to optimise
    alignments depending on the similarity of the
    sequences
  • Lower penalties gives more gaps

8
Similarity Scoring Matrices
  • Matrices used to score replacement of characters
    in a sequence
  • DNA matrix, can be binary, or more complex
  • Protein matrix, either binary, or based on
    mutation analysis, biophysical data or other
    easy of replacement options

9
Identity DNA scoring matrix
  • 1 for identity, 0 for mismatch

10
Complex DNA scoring matrix
  • For Example
  • 4 for identity, 2 for transition, 0 for
    transversion

11
Protein Matrices
  • Identity matrix 1 for identify, 0 for mismatch
  • Mutation data matrix derived from the analysis
    of the easy of substitution of one amino acid for
    another in protein evolution
  • Biophysical data matrix based on analysis of
    energy cost of substituting one residue for
    another in protein structures, solubility
    analysis, etc

12
Mutation Data Matrix
  • PAM - Point Acceptable Mutation various levels,
    differing sensitivities, derived from families of
    ancient, globular proteins
  • Blosum - constructed directly from multiple
    sequence alignments. Gives results closer to
    observed relationships. But no evolutionary model.

13
PAM 250 Matrix (partial)
  • Positive scores allow alignment
  • Negative scores indicate poor alignment
  • Larger is better (or worse)
  • Scale is Logarithmic

14
BLOSUM Matrix
  • Positive scores allow alignment
  • Negative scores indicate poor alignment
  • Larger is better (or worse)
  • Scale is Logarithmic

15
Matrix Comparison
  • Selection of matrix depends on extent of
    similarity expected or of interest

16
Blast Resources
  • Blast Information and resource links
  • Blast Tutorial
  • Statistics of Similarity Searching
  • General Rules for understanding results
  • Glossary of technical terms

17
BLAST
  • Basic Local Alignment and Search Tool
  • For rapid searching of pre-processed databases
  • 5 search strategies
  • Selection of databases and search targets
  • Calculated best local match
  • Gapped alignments
  • Statistics and histograms

18
(No Transcript)
19
(No Transcript)
20
Scoring an alignment
21
BLASTN
  • Compares Nucleotide sequence with Nucleotide
    Sequence Database
  • Complete range of DNA databases
  • Use to identify identity or close relationship
    between query sequence and database.
  • Not especially sensitive for identification of
    homology
  • Good for identity matching

22
Nucleotide databases 1
  • Nr - non redundant - Database of all published
    and much unpublished DNA sequence data, merged
    from Genbank, EMBL, DDBJ and PDB
  • Month - data added to nr within the last month
  • Gss - genome sequence survey preliminary genomic
    sequencing project data
  • HTGS - high throughput genome sequences,
    unfinished genomic sequences, assembled

23
Nucleotide databases 2
  • EST - Expressed sequence tags
  • Human EST - Human Expressed sequence tags
  • Mouse EST - Mouse Expressed sequence tags
  • Other EST - Expressed sequence tags from all
    other species
  • Other databases E.coli, Yeast, Mito, pbd, kabat,
    patents, vector, alu

24
BLASTP
  • Protein sequence against protein database
  • Complete range of Protein databases
  • Use to identify identity and distant relationship
    between query sequence and database.
  • Sensitive for identification of homology
  • Implementation as PSI-BLAST improves sensitivity
  • Good for identity matching

25
Protein Databases
  • Nr - non redundant - all available data from
    protein and DNA sequence
  • Month - most recent 30 days updates
  • Swissprot - curated and annotated database
  • PDB - protein sequences for 3D structures
  • E.coli - E.coli
  • Yeast - Saccharomyces
  • Kabat - immunological sequences

26
TBLASTN
  • Compares a protein query sequence against a
    nucleotide sequence database translated into all
    reading frames.
  • Sensitive ( as with BLASTP) for homology and
    identity
  • Used to identify possible coding sequences and
    homologies
  • Especially useful with genome and EST data

27
BLASTX
  • Compares a nucleotide query sequence translated
    in all reading frames against a protein sequence
    database.
  • Sensitive for homology and identity matching
  • You could use this option to find potential
    translation products of an unknown nucleotide
    sequence.
  • Useful for new DNA sequences

28
TBLASTX
  • Compares the six-frame translations of a
    nucleotide query sequence against the six-frame
    translations of a nucleotide sequence database.
  • Very useful for homology searching
  • Less useful with identities (matches everything 6
    times!)
  • Helps to select out conserved coding sequence
    from non-coding background
  • Especially useful for cross-species analysis with
    genomic, cDNA and EST data

29
Blast Inputs
  • Input sequence or accession number
  • Choice of search program
  • Choice of subsequence
  • Choice of database
  • Choice of codon table

30
Blast Input Page
31
Blast Matrix choice
32
Species/Genus/Phyla selection
33
Further species selection
34
Boolean operators
  • Logical operators used to specify selection
  • AND, OR, NOT, (IS, BEFORE, NEXT TO)
  • Use to get greater specificity of selection
  • For example
  • Mammal NOT Human
  • Limits selection to all mammals, but excluding
    humans.
  • Vertebrate NOT mammal
  • Would select all non-mammalian vertebrates

35
Format Submission
36
Email submission
37
Blast Output - summary
Input sequence, name and size
DATABASE - number of sequences, total length
38
Blast Output - Graphical
Regions of homology in matching sequences Colour
coded for scores
Input sequence scale
39
Blast Output - lists and stats
Database entry
Description of entry
Score
Stats
40
Blast Statistics
  • Statistical values calculated relative to the
    size of the database
  • Depend on the length of the match
  • Values expressed as exponentials
  • 3e-17 is 3 x 10-17
  • Smaller E value is better match, because
    statistically less likely to be a random event
  • Exact match has E0 (cannot be random)
  • Values greater than 10-5 are questionable

41
Blast alignments - exact match
42
Blast alignments - homology
43
BLAST-2-Sequences
  • Pairwise comparison of two sequences only
  • All 5 versions of BLAST available
  • (so all combinations of DNA/Protein possible)
  • Graphical display
  • Sequence alignments
  • Statistical significance
  • calculated from database
  • size

44
Statistical Significance and Histograms
  • Start with Tutorial on
  • The Statistics of Sequence Similarity Scores
  • Covers matrices, gapping, global vs local
    alignment, statistical significance.

45
Multiple Sequence alignments
  • Alignment of three more sequences together
  • Cannot do alignments simultaneously (excessively
    large computational problem)
  • So various options used to develop a rapid
    strategy to align sequences
  • Best option to align all sequences as pairs, then
    build multiple sequence alignments added most
    closely related sequences in order

46
Clustal approach to MSA
  • Compare all sequences with each other
  • Pair each sequence with the closest partner
  • Align closest partners
  • Align next closest partner to create groups
  • Align groups of sequences until completed
  • Build phylogenetic tree
  • Efficient method because only do pairwise
    alignments, and only align closest pairs.

47
Outputs from MSA analysis
  • Sequence alignment
  • Or frequency matrix (Profile)
  • Similarity plot
  • Phylogenetic tree
  • Applies to all programs used for MSA

48
Frequency matrix or Profiles
  • First matrix created from aligned protein
  • Multiplied by mutation data matric (PAM or
    BLOSUM)
  • Creates a matrix which is a frequency weighted
    matrix specific to the alignment of sequences
  • Provides very sensitive alignment tool
  • Can do similarly for DNA sequences

49
DWNN domain alignment
50
Phylogenetic trees
  • Way to show the best predicted evolutionary
    relationship between aligned sequences
  • Confidence level depends on method used
  • Should relate to evolutionary distance between
    sequences
  • Display distances as length and position of
    branches
  • Should show up orthologs and paralogs
  • Need to root trees correctly for them to give
    correct picture
  • Very distance sequences hard to be certain of
    order

51
DWNN as a phylogenetic marker
52
PSI-BLAST
  • Position Specific Iteration BLAST
  • Starts with a BLASTP search
  • Generates a set of matches
  • Select matches above a threshold
  • Align sequences scoring above threshold
  • Create a frequency matrix from this alignment
  • Search database with frequency matrix
  • Repeat 1-5 until no new sequences added above
    threshold level

53
Advantages of PSI-BLAST
  • Generates an alignment from single starting
    sequence
  • Creates a specific matrix for each search
    strategy
  • Final matrix should be the same for any family of
    proteins, which ever the starting sequence used
    for the search
  • Much more sensitive then BLASTP alone

54
Current disadvantages of PSI-BLAST
  • Does not show alignment used to generate matrix
  • Does not show matrix
  • Cannot generate final sequence alignment
  • May generate several alignments at one time if
    have several domains in the the protein
  • Only uses BLASTP, not TBLASTN or BLASTX, so
    databases restricted to protein

55
Databases of genomics data
  • Databases of genomic sequence data and
    predicted/known genes (eg NCBI Genomes)
  • Annotated databases, integrating genetic map,
    physical map (clones), sequence data, known
    genes, and predicted genes
  • Databases based on AceDB
  • Integration of physical and genetic mapping data
  • Common interface for genomics data

56
Databases of research papers
  • Many different sources
  • NCBI PubMed is major site of medical and
    molecular data. But is missing many plant and
    agricultural papers.
  • Agricola Agricultural/Biological
  • Current Contents Everything!
  • www.sciencedirect.com journals from Elselvier
    press, currently free to UWC
  • Others at UWC library web site

57
Search Engines
  • Various types and strategies
  • Web based spiders, crawlers etc (Google, Excite,
    Yahoo, Altavista)
  • Database based PubMed etc
  • PubMed provides comprehensive indexing
  • Internally compared to give related references
  • Linked access to sequences and literature sites

58
Search Strategies
  • Need to chose keywords carefully
  • Use author names when possible
  • Can sometimes select by dates
  • Use review as a search term when appropriate
  • Add more terms to get greater selectivity
  • Avoid general terms, like cell human gene
  • Use Boolean operators (and, or, not)
  • Look for related articles (in PubMed) based on
    internal text comparison to find most related
    papers

59
Genome annotation engines
  • Input DNA sequence data
  • Search with databases and predictive methods to
    identify possible coding sequences, promoters,
    splice junctions, exons, poly A sites, tRNA
    genes, repeat sequences,
  • Some sites do all of this, some dedicated to one
    type of analysis (eg promoters)
Write a Comment
User Comments (0)
About PowerShow.com