BCB 444/544 - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

BCB 444/544

Description:

ORFs, codon usage. What other types of information can be used? ... Use codon frequencies to compute probability of coding versus non-coding ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 54
Provided by: publicI
Category:
Tags: bcb | codon

less

Transcript and Presenter's Notes

Title: BCB 444/544


1
BCB 444/544
  • Lecture 27
  • Gene Prediction II
  • 27_Oct24

2
Required Reading (before lecture)
  • Mon Oct 22 - Lecture 26
  • Gene Prediction
  • Chp 8 - pp 97 - 112
  • Wed Oct 24 - Lecture 27 (will not be covered
    on Exam 2)
  • Promoter Regulatory Element Prediction
  • Chp 9 - pp 113 - 126
  • Thurs Oct 25 - Review Session Project Planning
  • Fri Oct 26 - EXAM 2

3
Assignments Announcements
  • Mon Oct 22 - Study Guide for Exam 2 was posted,
    finally
  • Mon Oct 22 - HW4 Due
  • (no "correct" answer to post)
  • Thu Oct 25 - no Lab gt Optional Review Session
    for Exam
  • 544 Project Planning/Consult with DD MT
  • Fri Oct 26 - Exam 2 - Will cover
  • Lectures 13-26 (thru Mon Sept 17)
  • Labs 5-8
  • HW 3 4
  • All assigned reading
  • Chps 6 (beginning with HMMs), 7-8, 12-16
  • Eddy What is an HMM
  • Ginalski Practical Lessons

4
BCB 544 "Team" Projects
  • 544 Extra HW2 is next step in Team Projects
  • Write 1 page outline
  • Schedule meeting with Michael Drena to discuss
    topic
  • Read a few papers
  • Write a more detailed plan
  • You may work alone if you prefer
  • Last week of classes will be devoted to Projects
  • Written reports due Mon Dec 3 (no class that
    day)
  • Oral presentations (15-20') will be Wed-Fri Dec
    5,6,7
  • 1 or 2 teams will present during each class
    period
  • See Guidelines for Projects posted online

5
BCB 544 Only New Homework Assignment
  • 544 Extra2 (posted online Thurs?)
  • No - sorry! sent by email on Sat
  • Due PART 1 - ASAP
  • PART 2 - Fri Nov 2 by 5 PM
  • Part 1 - Brief outline of Project, email to Drena
    Michael
  • after response/approval, then
  • Part 2 - More detailed outline of project
  • Read a few papers and summarize status of
    problem
  • Schedule meeting with Drena Michael to
    discuss ideas

6
Seminars this Week
  • BCB List of URLs for Seminars related to
    Bioinformatics
  • http//www.bcb.iastate.edu/seminars/index.html
  • Oct 25 Thur - BBMB Seminar 410 in 1414 MBB
  • Dave Segal UC Davis Zinc Finger Protein Design
  • Oct 19 Fri - BCB Faculty Seminar 210 in 102 ScI
  • Guang Song ComS, ISU Probing functional
    mechanisms by structure-based modeling and
    simulations

7
Chp 8 - Gene Prediction
  • SECTION III GENE AND PROMOTER PREDICTION
  • Xiong Chp 8 Gene Prediction
  • Categories of Gene Prediction Programs
  • Gene Prediction in Prokaryotes
  • Gene Prediction in Eukaryotes

8
What is a Gene?
  • What is a gene? segment of DNA, some of which is
    "structural," i.e., transcribed to give a
    functional RNA product, some of which is
    "regulatory"
  • Genes can encode
  • mRNA (for protein)
  • other types of RNA (tRNA, rRNA, miRNA, etc.)
  • Genes differ in eukaryotes vs prokaryotes (
    archaea), both structure regulation

9
Synthesis Processing of Eukaryotic mRNA
Gene in DNA
10
What are cDNAs ESTs?
  • cDNA libraries are important for determining gene
  • structure studying regulation of gene
    expression
  • Isolate RNA (always from a specific
  • organism, region, and time point)
  • Convert RNA to complementary DNA
  • (with reverse transcriptase)
  • Clone into cDNA vector
  • Sequence the cDNA inserts
  • Short cDNAs are called ESTs or
  • Expressed Sequence Tags
  • ESTs are strong evidence for genes
  • Full-length cDNAs can be difficult to obtain

11
UniGene Unique genes via ESTs
  • Find UniGene at NCBI
  • www.ncbi.nlm.nih.gov/UniGene
  • UniGene clusters contain many ESTs
  • UniGene data come from many cDNA libraries.
  • When you look up a gene in UniGene, you can
  • obtain information re level tissue
  • distribution of expression

12
Gene Prediction in Prokaryotes vs Eukaryotes
  • Eukaryotes
  • Large genomes 107 1010 bp
  • Often less than 2 coding
  • Complicated gene structure (splicing, long
    exons)
  • Prediction success 50-95
  • Prokaryotes
  • Small genomes 0.5 - 10106 bp
  • About 90 of genome is coding
  • Simple gene structure
  • Prediction success 99

13
Prediction is Easier in Microbial Genomes
Why? Smaller genomes Simpler gene
structures Many more sequenced genomes!
(for comparative approaches)
Many microbial genomes have been fully sequenced
whole-genome "gene structure" and "gene
function" annotations are available e.g.,
GeneMark.hmm, Glimmer TIGR
Comprehensive Microbial Resource (CMR) NCBI
Microbial Genomes
14
Gene Prediction - The Problem
  • Problem
  • Given a new genomic DNA sequence, identify coding
    regions and their predicted RNA and protein
    sequences
  • ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
  • ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

15
Computational Gene Prediction Approaches
  • Ab initio methods
  • Search by signal find DNA sequences involved in
    gene expression.
  • Search by content Test statistical properties
    distinguishing coding from non-coding DNA
  • Similarity-based methods
  • Database search exploit similarity to proteins,
    ESTs, cDNAs
  • Comparative genomics exploit aligned genomes
  • Do other organisms have similar sequence?
  • Hybrid methods - best

16
Computational Gene Prediction Algorithms
  • Neural Networks (NNs) (more on these later)
  • e.g., GRAIL
  • Linear discriminant analysis (LDA) (see text)
  • e.g., FGENES, MZEF
  • Markov Models (MMs) Hidden Markov Models (HMMs)
  • e.g., GeneSeqer - uses MMs
  • GENSCAN - uses 5th order HMMs - (see
    text)
  • HMMgene - uses conditional maximum
    likelihood (see text)

17
Gene Prediction Strategies
  • What sequence signals can be used?
  • Transcription TF binding sites, promoter,
    initiation site, terminator, GC islands, etc.
  • Processing signals Splice donor/acceptors,
    polyA signal
  • Translation Start (AUG Met) stop (UGA,UUA,
    UAG)
  • ORFs, codon usage
  • What other types of information can be used?
  • Homology (sequence comparison, BLAST)
  • cDNAs ESTs (experimental data, pairwise
    alignment)

18
Signals Search
  • Approach Build models (PSSMs, profiles, HMMs,
    ) and search against DNA. Detected instances
    provide evidence for genes

19
DNA Signals Used in Gene Prediction
  • Exploit the regular gene structure
  • ATGExon1Intron1Exon2ExonNSTOP
  • Recognize coding bias
  • CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-
  • Recognize splice sites
  • IntroncAGtExongGTgagIntron
  • Model the duration of regions
  • Introns tend to be much longer than exons, in
    mammals
  • Exons are biased to have a given minimum length
  • Use cross-species comparison
  • Gene structure is conserved in mammals
  • Exons are more similar (85) than introns

20
Content Search
  • Observation Encoding a protein affects
    statistical properties of DNA sequence
  • Nucleotide composition
  • Hexamer frequency
  • GC content (CpG islands, exon/intron)
  • Uneven usage of synonymous codons (codon bias)
  • Method Evaluate these differences (coding
    statistics) to differentiate between coding and
    non-coding regions

21
Human Codon Usage
22
Predicting Genes based on Codon Usage
Differences
  • Algorithm
  • Process sliding window
  • Use codon frequencies to compute probability of
    coding versus non-coding
  • Plot log-likelihood ratio

23
Similarity-Based Methods Database Search
  • In different genomes Translate DNA into all 6
    reading frames and search against proteins
    (TBLASTX,BLASTX, etc.)
  • Within same genome Search with EST/cDNA database
  • (EST2genome, BLAT, etc.).
  • Problems
  • Will not find new or RNA genes (non-coding
    genes).
  • Limits of similarity are hard to define
  • Small exons might be overlooked

24
Similarity-Based Methods Comparative Genomics
  • Idea Functional regions are more conserved than
    non-functional ones high similarity in alignment
    indicates gene
  • Advantages
  • May find uncharacterized or RNA genes
  • Problems
  • Finding suitable evolutionary distance
  • Finding limits of high similarity (functional
    regions)

25
Human-Mouse Homology
  • Comparison of 1196 orthologous genes
  • Sequence identity between genes in human vs mouse
  • Exons 84.6
  • Protein 85.4
  • Introns 35
  • 5 UTRs 67
  • 3 UTRs 69

26
Gene Prediction Flowchart
Fig 5.15 Baxevanis Ouellette 2005
27
Predicting Genes - Basic steps
  • Obtain genomic sequence
  • BLAST it!
  • Perform database similarity search
  • (with EST cDNA databases, if
    available)
  • Translate in all 6 reading frames
  • (i.e., "6-frame translation")
  • Compare with protein sequence databases
  • Use Gene Prediction software to locate genes
  • Compare results obtained using different
    programs
  • Analyze regulatory sequences, too
  • Refine gene prediction

28
Predicting Genes - a few Details
  • 1. 1st, mask to "remove" repetitive elements
    (ALUs, etc.)
  • Perform database search on translated DNA
    (BlastX,TFasta)
  • Use several programs to predict genes find
    ORFs (GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL)
  • Search for functional motifs in translated ORFs
    in neighboring DNA sequences (InterPro,
    Transfac)
  • Repeat

29
Thanks to Volker Brendel, ISU for the following
Figs Slides
  • Slightly modified from
  • BSSI Genome Informatics Module
  • http//www.bioinformatics.iastate.edu/BBSI/course_
    desc_2005.htmlmoduleB
  • V Brendel vbrendel_at_iastate.edu

Brendel et al (2004) Bioinformatics 20 1157
30
GeneSeqer
Genomic Sequence
Fast Search
Spliced Alignment
EST or protein database (Suffix Array/Suffix Tree)
Output
Assembly
Brendel 2005
31
Spliced Alignment Algorithm
GeneSeqer - Brendel et al.- ISU
http//deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Brendel et al (2004) Bioinformatics 20
1157 http//bioinformatics.oxfordjournals.org/cgi/
content/abstract/20/7/1157
  • Perform pairwise alignment with large gaps in one
    sequence (due to introns)
  • Align genomic DNA with cDNA, ESTs, protein
    sequences
  • Score semi-conserved sequences at splice
    junctions
  • Using Bayesian probability model 1st order MM
  • Score coding constraints in translated exons
  • Using Bayesian model

Brendel 2005
32
Signals Pre-mRNA Splicing
Brendel 2005
33
Brendel - Spliced Alignment I Compare with cDNA
or EST probes
Brendel 2005
34
Brendel - Spliced Alignment II Compare with
protein probes
Brendel 2005
35
Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal?
YES
i ith position in sequence I avg
information content over all positions gt20 nt
from splice site ?I avg sample standard
deviation of I
Brendel 2005
36
Information Content vs Position
Which sequences are exons which are introns?
How can you tell?
Brendel 2005
37
Donor (GT) Acceptor (AG) Sites Used for Model
Training
Brendel 2005
38
Markov Model for Spliced Alignment
Brendel 2005
39
Evaluation of Predictions
Predicted Positives
True Positives
False Positives
Coverage
Recall
Do not memorize this!
40
Evaluation of Predictions - in English
Coverage
IMPORTANT Sensitivity alone does not tell us
much about performance because a 100 sensitivity
can be trivially achieved by labeling all test
cases positive!
In English? Sensitivity is the fraction of all
positive instances having a true positive
prediction.
Recall
IMPORTANT in medical jargon, Specificity is
sometimes defined differently (what we define
here as "Specificity" is sometimes referred to as
"Positive predictive value")
In English? Specificity is the fraction of all
predicted positives that are, in fact, true
positives.
41
Best Measures for Comparison?
  • ROC curves (Receiver Operating Characteristic
    (?!!)
  • http//en.wikipedia.org/wiki/Roc_curve
  • Correlation Coefficient
  • Matthews correlation coefficient (MCC)
  • MCC 1 for a perfect prediction
  • 0 for a completely random assignment
  • -1 for a "perfectly incorrect" prediction

In signal detection theory, a receiver operating
characteristic (ROC), or ROC curve is a plot of
sensitivity vs (1 - specificity) for a binary
classifier system as its discrimination threshold
is varied. The ROC can also be represented
equivalently by plotting fraction of true
positives (TPR true positive rate) vs fraction
of false positives (FPR false positive rate)
Do not memorize this!
42
GenSeqer Performance?
  • Plots such as these ( ROCs) are much better than
    using a "single number" to compare different
    methods
  • Such plots illustrate trade-off Sn vs Sp
  • Note the above are not ROC curves (plots of Sn
    vs 1-Sp)

Brendel 2005
43
GeneSeqer Results on Different Genomes
Brendel 2005
44
Performance of GeneSeqer vs Others?
  • Comparison with ab initio gene prediction
  • vs GENSCAN an HMM-based ab initio method
  • "Winner" depends on
  • Availability of ESTs
  • Level of similarity to protein homologs

Brendel 2005
45
GeneSeqer vs GENSCAN (Exon prediction)
Target protein alignment score
GENSCAN - Burge, MIT
Brendel 2005
46
GeneSeqer vs GENSCAN (Intron prediction)
GENSCAN - Burge, MIT
Brendel 2005
47
GeneSeqer Input http//deepc2.psi.iastate.edu/cg
i-bin/gs.cgi
Brendel 2005
48
GeneSeqer Output
Brendel 2005
49
GeneSeqer Gene Evidence Summary
Brendel 2005
50
Gene Prediction - Problems Status?
  • Common errors?
  • False positive intergenic regions
  • 2 annotated genes actually correspond to a single
    gene
  • False negative intergenic region
  • One annotated gene structure actually contains 2
    genes
  • False negative gene prediction
  • Missing gene (no annotation)
  • Other
  • Partially incorrect gene annotation
  • Missing annotation of alternative transcripts
  • Current status?
  • For ab initio prediction in eukaryotes HMMs have
    better overall performance for detecting
    untron/exon boundaries
  • Limitation? Training data predictions are
    organism specific
  • Combined ab initio/homology based predictions
    Improved accurracy
  • Limitation? Availability of identifiable
    sequence homologs in databases

51
Recommended Gene Prediction Software
  • Ab initio
  • GENSCAN http//genes.mit.edu/GENSCAN.html
  • GeneMark.hmm http//exon.gatech.edu/GeneMark/
  • others GRAIL, FGENES, MZEF, HMMgene
  • Similarity-based
  • BLAST, GenomeScan, EST2Genome, Twinscan
  • Combined
  • GeneSeqer, ROSETTA
  • Consensus because results depend on organisms
    specific task, Always use more than one
    program!
  • Two servers hat report consensus predictions
  • GeneComber
  • DIGIT

52
Other Gene Prediction Resources at ISU
http//www.bioinformatics.iastate.edu/bioinformati
cs2go/
53
Other Gene Prediction Resources GaTech, MIT,
Stanford, etc.
Lists of Gene Prediction Software http//www.bioi
nformaticsonline.org/links/ch_09_t_1.html http//
cmgm.stanford.edu/classes/genefind/
  • Current Protocols in Bioinformatics (BCB/ISU owns
    a copy - currently in my lab!)
  • Chapter 4 Finding Genes
  • 4.1 An Overview of Gene Identification
    Approaches, Strategies, and Considerations
  • 4.2 Using MZEF To Find Internal Coding Exons
  • 4.3 Using GENEID to Identify Genes
  • 4.4 Using GlimmerM to Find Genes in Eukaryotic
    Genomes
  • 4.5 Prokaryotic Gene Prediction Using GeneMark
    and GeneMark.hmm
  • 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm
  • 4.7 Application of FirstEF to Find Promoters and
    First Exons in the Human Genome
  • 4.8 Using TWINSCAN to Predict Gene Structures in
    Genomic DNA Sequences
  • 4.9 GrailEXP and Genome Analysis Pipeline for
    Genome Annotation
  • 4.10 Using RepeatMasker to Identify Repetitive
    Elements in Genomic Sequences
Write a Comment
User Comments (0)
About PowerShow.com