Predictive methods using DNA sequences Unit 11 - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Predictive methods using DNA sequences Unit 11

Description:

Missense type of nonsynonymous (different amino acid in the product of mutated ... COOKY. MONEY. Example of (Positional) Weight Matrix ... – PowerPoint PPT presentation

Number of Views:767
Avg rating:3.0/5.0
Slides: 58
Provided by: irenegab
Category:

less

Transcript and Presenter's Notes

Title: Predictive methods using DNA sequences Unit 11


1
Predictive methods using DNA sequencesUnit 11
  • BIOL221T Advanced Bioinformatics for
    Biotechnology

Irene Gabashvili, PhD
2
Reminders from last week
  • Polymorphism and mutations
  • Mapping and Sequencing
  • Genomic Map Elements
  • Types of Maps
  • Resources
  • Practical Use

3
Polymorphism - Types of variation
  • SNPsnp_class, True single nucleotide
    polymorphism
  • in-del, Insertion deletion polymorphism
    ('-/)
  • Microsatellite/simple sequence repeat
  • FUNC Function_Class
  • "coding nonsynonymous
  • locus region, intron, exception
  • mrna, utr , splice site
  • coding synonymous

4
Nonsynonymous Mutations
  • Missense type of nonsynonymous (different amino
    acid in the product of mutated genes)
  • EXAMPLE sickle-cell disease The replacement of A
    by T at the 17th nucleotide of the gene for the
    beta chain of hemoglobin changes the codon GAG
    (for glutamic acid) to GTG (which encodes
    valine). Thus the 6th amino acid in the chain
    becomes valine instead of glutamic acid

5
Nonsynonymous Mutations
  • Another example of a missense mutation In one
    patient with cystic fibrosis (Patient B), the
    substitution of a T for a C at nucleotide 1609
    converted a glutamine codon (CAG) to a STOP codon
    (TAG). The protein produced by this patient had
    only the first 493 amino acids of the normal
    chain of 1480 and could not function.

6
Nonsynonymous Mutations
  • The new nucleotide changes a codon that specified
    an amino acid to one of the STOP codons (TAA,
    TAG, or TGA). Therefore, translation of the
    messenger RNA transcribed from this mutant gene
    will stop prematurely. The earlier in the gene
    that this occurs, the more truncated the protein
    product and the more likely that it will be
    unable to function. These type of mutations are
    called nonsense mutations

7
Insertions and Deletions (Indels) ADRB1gene
AND humanorgn AND "in-del"snp_class
  • Base pairs may be added (insertions) or removed
    (deletions) from the DNA of a gene. The number
    can range from one to thousands.
  • As a result, translation of the gene can be
    "frameshifted". Indels of three nucleotides or
    multiples of three may be less serious.
  • Huntington's disease and the fragile X syndrome
    are examples of trinucleotide repeat diseases
    caused by insertion

8
Silent and splice-site mutations
  • For example, if the third base in the TCT codon
    for serine is changed to any one of the other
    three bases, serine will still be encoded. Such
    mutations are said to be silent because they
    cause no change in protein (synonymous)
  • Nucleotide signals at the splice sites guide the
    enzymatic machinery. If a mutation alters one of
    these signals, then the intron is not removed and
    remains as part of the final RNA molecule. This
    alters the sequence of the protein product.

9
Types of Maps see MapViewer
  • Cytogenetic
  • Genetic Linkage
  • Physical
  • Radiation Hybrid
  • Sequence-based

10
Genomic Map Elements
  • DNA markers, PACR-based
  • STS
  • Polymorphic markers
  • RFLPs, VNTRs, SNPs
  • DNA clones
  • BACs and PACs

11
Databases Servers
  • BLAT
  • MapView
  • GeneCards
  • GeneLoc
  • Stanford Source
  • Bioinformatics Harvester

12
Predictive methods using DNA sequences, BO
chapter 5
  • Gene Prediction methods
  • Gene Prediction Programs
  • How good the methods are?
  • Promoter Analysis
  • Strategies and Considerations
  • Markov models
  • HMMs in Gene Prediction
  • Discriminant Analysis in Gene Prediction

13
Sequence Signals Gene Structure
14
Sequence Signals Gene Structure
  • UCSC Genome Browser
  • Ensembl
  • NCBIs Gene Viewer

15
What is Computational Gene Finding?
  • Given an uncharacterized DNA sequence, find out
  • Which region codes for a protein?
  • Which DNA strand is used to encode the gene?
  • Which reading frame is used in that strand?
  • Where does the gene starts and ends?
  • Where are the exon-intron boundaries in
    eukaryotes?
  • (optionally) Where are the regulatory sequences
    for that gene?

16
Gene Prediction Methods
  • Searching by Signal
  • Searching by Content
  • Homology-based Gene Prediction
  • Comparative Gene Prediction
  • Ab initio, intrinsic, template (1st and 2nd)
    vs extrinsic, look-up (3rd and 4th)

17
Eukaryotes vs Prokaryotes
  • Genes separated by intergenic DNA, coding exons
    separated by large introns vs ORFs adjacent to
    one another

18
Prokaryotic Vs. Eukaryotic Gene Finding
  • Prokaryotes
  • small genomes 0.5 10106 bp
  • high coding density (gt90)
  • no introns
  • Gene identification relatively easy, with success
    rate 99
  • Problems
  • overlapping ORFs
  • short genes
  • finding TSS and promoters
  • Eukaryotes
  • large genomes 107 1010 bp
  • low coding density (lt50)
  • intron/exon structure
  • Gene identification a complex problem, gene level
    accuracy 50
  • Problems
  • many

19
Gene Structure
20
Gene Finding Different Approaches
  • Similarity-based methods (extrinsic) - use
    similarity to annotated sequences
  • proteins
  • cDNAs
  • ESTs
  • Comparative genomics - Aligning genomic sequences
    from different species
  • Ab initio gene-finding (intrinsic)
  • Integrated approaches

21
Similarity-based methods
  • Based on sequence conservation due to functional
    constraints
  • Use local alignment tools (Smith-Waterman algo,
    BLAST, FASTA) to search protein, cDNA, and EST
    databases
  • Will not identify genes that code for proteins
    not already in databases (can identify 50 new
    genes)
  • Limits of the regions of similarity not well
    defined

22
Comparative Genomics
  • Based on the assumption that coding sequences are
    more conserved than non-coding
  • Two approaches
  • intra-genomic (gene families)
  • inter-genomic (cross-species)
  • Alignment of homologous regions
  • Difficult to define limits of higher similarity
  • Difficult to find optimal evolutionary distance
    (pattern of conservation differ between loci)

23
(No Transcript)
24
Summary for Extrinsic Approaches
  • Strengths
  • Rely on accumulated pre-existing biological data,
    thus should produce biologically relevant
    predictions
  • Weaknesses
  • Limited to pre-existing biological data
  • Errors in databases
  • Difficult to find limits of similarity

25
Signal Sensors
  • Signal a string of DNA recognized by the
    cellular machinery

26
Signal Sensors
  • Various pattern recognition method are used for
    identification of these signals
  • consensus sequences
  • weight matrices
  • weight arrays
  • decision trees
  • Hidden Markov Models (HMMs)
  • neural networks

27
Example of Consensus Sequence
  • obtained by choosing the most frequent base at
    each position of the multiple alignment of
    subsequences of interest
  • TACGAT
  • TATAAT
  • TATAAT
  • GATACT
  • TATGAT
  • TATGTT
  • consensus sequence
  • consensus (IUPAC)
  • Leads to loss of information and can produce
  • many false positive or false negative
    predictions

TATAAT
MELON MANGO HONEY SWEET COOKY
TATRNT
MONEY
28
Example of (Positional) Weight Matrix
  • Computed by measuring the frequency of every
    element of every position of the site (weight)
  • Score for any putative site is the sum of the
    matrix values (converted in probabilities) for
    that sequence (log-likelihood score)
  • Disadvantages
  • cut-off value required
  • assumes independence between adjacent bases

TACGAT TATAAT TATAAT GATACT TATGAT TATGTT
29
Example of Decision Tree
30
Markov Models
31
Ingredients of a Markov Model
  • Collection of states
  • S1, S2, ,SN
  • State transition probabilities (transition
    matrix)
  • Aij P(qt1 Si qt Sj)
  • Initial state distribution
  • ?i P(q1 Si)

32
Hidden Markov Models
33
Ingredients of a HMM
  • Collection of states S1, S2,,SN
  • State transition probabilities (transition
    matrix)
  • Aij P(qt1 Si qt Sj)
  • Initial state distribution
  • ?i P(q1 Si)
  • Observations O1, O2,,OM
  • Observation probabilities
  • Bj(k) P(vt Ok qt Sj)

34
Examples of Gene Finders
  • FGENES linear DF for content and signal sensors
    and DP for finding optimal combination of exons
  • GeneMark HMMs enhanced with ribosomal binding
    site recognition
  • Genie neural networks for splicing, HMMs for
    coding sensors, overall structure modeled by HMM
  • Genscan WM, WA and decision trees as signal
    sensors, HMMs for content sensors, overall HMM
  • HMMgene HMM trained using conditional maximum
    likelihood
  • Morgan decision trees for exon classification,
    also Markov Models
  • MZEF quadratic DF, predict only internal exons

35
Genscan Example
  • Developed by Chris Burge 1997
  • One of the most accurate ab initio programs
  • Uses explicit state duration HMM to model gene
    structure (different length distributions for
    exons)
  • Different model parameters for regions with
    different GC content

36
Ab initio Gene Finding is Difficult
  • Genes are separated by large intergenic regions
  • Genes are not continuous, but split in a number
    of (small) coding exons, separated by (larger)
    non-coding introns
  • in humans coding sequence comprise only a few
    percents of the genome and an average of 5 of
    each gene
  • Sequence signals that are essential for
    elucidation of a gene structure are degenerate
    and highly unspecific
  • Alternative splicing
  • Repeat elements (gt50 in humans) some contain
    coding regions

37
Problems with Ab initio Gene Finding
  • No biological evidence
  • In long genomic sequences many false positive
    predictions
  • Prediction accuracy high, but not sufficient

38
Evaluation of Gene Finding Programs
  • Calculating accuracy of programs predictions
  • Many evaluation studies, one of the earliest
  • Burset and Guigó, 1996 (vertebrate sequences)
  • Pavy et al., 1999 (Arabidopsis thaliana)
  • Rogic et al., 2001 (mammalian sequences)

39
Measures of Prediction Accuracy,
  • Nucleotide level accuracy
  • Sensitivity
  • Specificity

number of correct exons
number of actual exons
number of correct exons
number of predicted exons
40
Measures of Prediction Accuracy, Part 2
  • Exon level accuracy

MISSING EXON
WRONGEXON
CORRECTEXON
REALITY
PREDICTION
41
Integrated Approaches for Gene Finding
  • Programs that integrate results of similarity
    searches with ab initio techniques (GenomeScan,
    FGENESH, Procrustes)
  • Programs that use synteny between organisms
    (ROSETTA, SLAM)
  • Integration of programs predicting different
    elements of a gene (EuGène)
  • Combining predictions from several gene finding
    programs (combination of experts)

42
Combining Programs Predictions
  • Set of methods used and they way they are
    integrated differs between individual programs
  • Different programs often predict different
    elements of an actual gene
  • they could complement each other yielding
    better prediction

43
Gene Prediction Links
  • http//genome.imim.es/geneid.html
  • http//genes.mit.edu/GENSCAN.html
  • FGENES, commercial, but can try
  • http//www.softberry.com/berry.phtml?topicfgenes
    groupprogramssubgroupgfind

44
GeneID
  • Hierarchical approach
  • Splice sites and stop codons predicted and scored
    using position-specific weight matrices
  • Exons built from identified defining sites.
    Scored as the sum of scores of defining sites
    plus the score of their coding potential
  • Maximization of all the score to assemble gene
    structure
  • Latesr versions of the program add sequence
    similarity searches

45
GeneScan,Fgenes, Genewise
  • GeneSCAN - Underlying Hidden Markov Model program
  • FGENES linear discriminant analysis to identify
    splice sites, exons, promoter elements
  • Genewise compares a genomic sequence with a
    protein sequence or with HMMs representing
    protein sequences

46
How good the methods are?
  • Different methods - different results. How to
    measure accuracy?
  • Sensitivity proportion of coding nucleotides,
    exons, genes predicted correctly (true positives)
  • Specificity proportion of predicted elements,
    genes that are real
  • Correlation coefficient combines both

47
Screening Test for Occult Cancer
  • 100 patients with occult cancer 95 have "x" in
    their blood
  • 100 patients without occult cancer 95 do not
    have "x" in their blood
  • 5 out of every 1000 randomly selected individuals
    will have occult cancer

SENSITIVITY
SPECIFICITY
PREVALENCE
48
2 X 2 Table
100,000
If a patient has x in his blood, chance of
occult canceris 475 / 5475 8.7
49
Standard Terminology
True Positives (TPs)
False Negatives (FNs)
False Positives (FPs)
True Negatives (TNs)
Entire Population
50
Definitions
51
What is a Positive Test?
  • All the analysis has assumed that it is clear
    whether a test is positive or negative
  • In reality, many tests involve continuous values
    so that one result may be more positive than
    another
  • How should one define the cut-off at which a test
    is judged to be abnormal?

52
Continuously Valued Variables
Result
53
Continuously Valued Variables
  • Fewer false positives (more conservative)
  • More false negatives
  • Higher specificity
  • Lower sensitivity

Normal cutoff
Result
54
Continuously Valued Variables
Result
  • Fewer false negatives (more aggressive)
  • More false positives
  • Higher sensitivity
  • Lower specificity

55
More on Projects
  • vaccine development (Ramya)
  • http//immunax.dfci.harvard.edu/PEPVAC/
  • HIV the Black Plague Harshal
  • CCR5 - chemokine (C-C motif) receptor 5
  • HIV drug esistance http//hivdb.stanford.edu/
  • Gene Annotation Chris
  • Pharmacogenomics Jennifer

56
More on Projects
  • Disease network Jyoti
  • Disease networks (gout) Annie
  • Genotyping Nancy
  • Physiological Genomics Erin
  • Harmeet perl program for protein structure
    analysis?

57
More on Projects
  • Human Genetic Variation - Priyanka
  • Cloning (humans) Parag
  • Evolution Sukhpreet
  • Metabolic engineering - Danh
  • Protein structure - Tanzeema
Write a Comment
User Comments (0)
About PowerShow.com