Gene Prediction - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Gene Prediction

Description:

Sites (EST) database of same organism, or cDNA sequences if available ... Fugu. worm. E.coli. Gene finding: splicing. cis-splicing of genes ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 54
Provided by: chu69
Category:
Tags: fugu | gene | prediction

less

Transcript and Presenter's Notes

Title: Gene Prediction


1
Gene Prediction
  • Chuong Huynh
  • NIH/NLM/NCBI
  • July 18, 2002
  • huynh_at_ncbi.nlm.nih.gov

Acknowledgement Daniel Lawson, Neil Hall
2
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and
compare to protein sequence databases 2. Perform
database similarity search of expressed sequence
tag Sites (EST) database of same organism, or
cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
3
Why is gene prediction important?
  • Increased volume of genome data generated
  • Paradigm shift from gene by gene sequencing
    (small scale) to large-scale genome sequencing.
  • No more one gene at a time. A lot of data.
  • Foundation for all further investigation.
    Knowledge of the protein-coding regions underpins
    functional genomics.

Note this presentation is for the prediction of
genes that encode protein only Not promoter
prediction, sequences regulate activity of
protein encoding genes
4
What is gene prediction?
  • Detecting meaningful signals in uncharacterised
    DNA sequences.
  • Knowledge of the interesting information in DNA.
  • Sorting the chaff from the wheat
  • Gene prediction is recognising protein-coding
    regions in genomic sequence

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
Artemis Free Genome Visualization/Annotation
Workbench
9
Knowing what to look for
  • What is a gene?
  • Not a full transcript with control regions
  • The coding sequence (ATG -gt STOP)

10
ORF Finding in Prokaryotes
  • Simplest method of finding DNA sequences that
    encode proteins by searching for open reading
    frames
  • An ORF is a DNA sequence that contains a
    contiguous set of codons that species an amino
    acid
  • Six possible reading frames
  • Good for prokaryotic system (no/little post
    translation modification)
  • Runs from Met (AUG) on mRNA ? stop codon TER
    (UAA, UAG, UGA)
  • http//www.ncbi.nih.gov/gorf/ ???? NCBI ORF Finder

11
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction(w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
Comparative gene prediction (use other biological
data)
AAAAAAA
Gm3
Mature mRNA
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
12
Gene finding Issues
  • Issues regarding gene finding in general
  • Genome size
  • (larger genome more genes, but )
  • Genome composition
  • Genome complexity
  • (more complexity -gt less coding density fewer
    genes per kb)
  • cis-splicing (processing mRNA in Eukaryotics)
  • trans-splicing (in kinetisplastid)
  • alternate splicing (e.g. in different tissues
    higher organism)
  • Variation of genetic code from the universal code

13
Gene finding genome
  • Genome composition
  • Long ORFs tend to be coding
  • Presence of more putative ORFs in GC rich
    genomes (Stop codons UAA, UAG UGA)
  • Genome complexity
  • Simple repetitive sequences (e.g. dinucleotide)
    and dispersed repeats tend to be anti-coding
  • May need to mask sequence prior to gene prediction

14
Gene finding coding density
  • As the coding/non-coding length ratio decreases,
    exon prediction becomes more complex

Human
Fugu
worm
E.coli
15
Gene finding splicing
  • cis-splicing of genes
  • Finding multiple (short) exons is harder than
    finding a single (long) exon.
  • trans-splicing of genes
  • A trans-splice acceptor is no different to a
    normal splice acceptor

worm
E.coli
16
Gene finding alternate splicing
  • Alternate splicing (isoforms) are very difficult
    to predict.

Human A
Human B
Human C
17
ab initio prediction
  • What is ab initio gene prediction?
  • Prediction from first principles using the raw
    DNA sequence only.

Requires training sets of known gene structures
to generate statistical tests for the likelihood
of a prediction being real.
18
Gene finding ab initio
  • What features of a ORF can we use?
  • Size - large open reading frames
  • DNA composition - codon usage / 3rd position
    codon bias
  • Kozak sequence CCGCCAUGG
  • Ribosome binding sites
  • Termination signal (stops)
  • Splice junction boundaries (acceptor/donor)

19
Gene finding features
  • Think of a CDS gene prediction as a linear
    series of sequence features

Initiation codon
Coding sequence (exon)
Splice donor (5)
N times
Non-coding sequence (intron)
Splice acceptor (3)
Coding sequence (exon)
Termination codon
20
A model ab initio predictor
  • Locate and score all sequence features used in
    gene models
  • dynamic programming to make the high scoring
    model from available features.
  • e.g. Genefinder (Green)
  • Running a 5-gt 3 pass the sequence through a
    Markov model based on a typical gene model
  • e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER
    (Salzberg)
  • Running a 5-gt3 pass the sequence through a
    neural net trained with confirmed gene models
  • e.g. GRAIL (Oak Ridge)

21
Ab initio Gene finding programs
  • Most gene finding software packages use a some
    variant of Hidden Markov Models (HMM).
  • Predict coding, intergenic, and intron sequences
  • Need to be trained on a specific organism.
  • Never perfect!

22
What is an HMM?
  • A statistical model that represents a gene.
  • Similar to a weight matrix that can recognise
    gaps and treat them in a systematic way.
  • Has different states that represent introns,
    exons, and intergenic regions.

23
Malaria Gene Prediction Tool
  • Hexamer ftp//ftp.sanger.ac.uk/pub/pathogens/sof
    tware/hexamer/
  • Genefinder email colin_at_u.washington.edu
  • GlimmerM http//www.tigr.org/softlab/glimmerm
  • Phat http//www.stat.berkeley.edu/users/scawley/
    Phat
  • Already Trained for Malaria!!!! The more
    experimental derived genes used for training the
    gene prediction tool the more reliable the gene
    predictor.

24
GlimmerMSalzberg et al. (1999) genomics 59 24-31
  • Adaption of the prokaryotic genefinder Glimmer.
  • Delcher et al. (1999) NAR 2 4363-4641
  • Based on a interpolated HMM (IHMM).
  • Only used short chains of bases (markov chains)
    to generate probabilities.
  • Trained identically to Phat

25
An end to ab initio prediction
  • ab initio gene prediction is inaccurate
  • High false positive rates for most predictors
  • Exon prediction sensitivity can be good
  • Rarely used as a final product
  • Human annotation runs multiple algorithms and
    scores exon predicted by multiple predictors.
  • Used as a starting point for refinement/verificat
    ion
  • Prediction need correction and validation
  • -- Why not just build gene models by comparative
    means?

26
PAUSE(continue)
27
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction (w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
AAAAAAA
Gm3
Mature mRNA
Comparative gene prediction(use other biological
data)
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
28
If a cell was human?
  • The cell knows how to splice a gene together.
  • We know some of these signals but not all and
    not all of the time
  • So compare with known examples from the species
    and others

Central dogma for molecular biology Genome
Transcriptome Proteome
29
When a human looks at a cell
  • Compare with the rest of the genome/transcriptome
    /proteome data

30
comparative gene prediction
  • Use knowledge of known coding sequences to
    identify region of genomic DNA by similarity
  • transcriptome - transcribed DNA sequence
  • proteome - peptide sequence
  • genome - related genomic sequence

31
Transcript-based prediction datasets
  • Generation of large numbers of Expressed
    Sequence Tags (ESTs)
  • Quick, cheap but random
  • Subtractive hybridisation to find rare
    transcripts
  • Use multiple libraries for different
    life-stages/conditions
  • Single-pass sequence prone to errors
  • Generation of small number of full length cDNA
    sequences
  • Slow and laborious but focused
  • Large-scale sequencing of (presumed) full length
    cDNAs
  • Systematic, multiplexed cloning/sequencing of
    CDS
  • Expensive and only viable if part of bigger
    project

32
Gene Prediction in Eukaryotes Simplified
  • For highly conserved proteins
  • Translate DNA sequence in all 6 reading frames
  • BLASTX or FASTAX to compare the sequence to a
    protein sequence database
  • Or
  • Protein compared against nucleic acid database
    including genomic sequence that is translated in
    all six possible reading frame sby TBLASTN,
    TFASTAX/TFASTY programs.
  • Note Approximation of the gene structure only.

33
Transcript-based prediction How it works
  • Align transcript data to genomic sequence using
    a pair-wise sequence comparison

Gene Model
EST
cDNA
34
Transcript-based gene prediction algorithm
  • BLAST (Altshul) (36 hours)
  • Widely used and understood
  • HSPs often have ragged ends so extends to the
    end of the introns
  • EST_GENOME (Mott) (3 days)
  • Dynamic programming post-process of BLAST
  • Slow and sometimes cryptic
  • BLAT (Kent) (1/2 hour)
  • Next generation of alignment algorithm
  • Design for looking at nearly identical sequences
  • Faster and more accurate than BLAST

35
Peptide-based gene prediction algorithm
  • BLAST (Altshul)
  • Widely used and understood
  • Smith-Waterman
  • Preliminary to further processing
  • Used in preference to DNA-based similarities for
    evolutionary diverged species as peptide
    conservation is significantly higher than
    nucleotide

36
Genomic-based gene prediction algorithm
  • BLAST (Altshul)
  • Can be used in TBLASTX mode
  • BLAT (Kent)
  • Can be used in a translated DNA vs translated
    DNA mode
  • Significantly faster than BLAST
  • WABA (Kent)
  • Designed to allow for 3rd position codon wobble
  • Slow with some outstanding problems
  • Only really used in C.elegans v C.briggsae
    analysis

37
Comparative gene predictors
  • This can be viewed as an extension of the ab
    initio prediction tools where coding exons are
    defined by similarities and not codon bias
  • GAZE (Howe) is an extension of Phil Greens
    Genefinder in which transcript data is used to
    define coding exons. Other features are scored as
    in the original Genefinder implementation. This
    is being evaluated and used in the C.elegans
    project.
  • GENEWISE (Birney) is a HMM based gene predictor
    which attempts to predict the closest CDS to a
    supplied peptide sequence. This is the workhorse
    predictor for the ENSEMBL project.

38
Comparative gene predictors
  • A new generation of comparative gene prediction
    tools is being developed to utilise the large
    amount of genomic sequence available.
  • Twinscan (WashU) attempts to predict genes using
    related genomic sequences.
  • Doublescan (Sanger) is a HMM based gene
    predictor which attempts to predict 2 orthologous
    CDSs from genomic regions pre-defined as
    matching.
  • Both of these predictors are in development and
    will be used for the C.elegans v C.briggsae match
    and the Mouse v Human match later this year.

39
Summary
  • Genes are complex structure which are difficult
    to predict with the required level of
    accuracy/confidence
  • We can predict stops better than starts
  • We can only give gross confidence levels to
    predictions (i.e. confirmed, partially confirmed
    or predicted)
  • Gene prediction is only part of the annotation
    procedure
  • Movement from ab initio to comparative
    methodology as sequence data becomes
    available/affordable
  • Curation of gene models is an active process
    the set of gene models for a genome is fluid and
    WILL change over time.

40
The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
41
Annotation Process
42
Artemis
  • Artemis is a free DNA sequence viewer and
    annotation tool that allows visualization of
    sequence features and the results of analyses
    within the context of the sequence, and its
    six-frame translation.
  • http//www.sanger.ac.uk/Software/Artemis/

43
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
44
DNA in Artemis
GC content
Black bar stop codon
Forward translations
Reverse Translations
DNA and amino acids
45
Extra Slides
46
Gene prediction
  • What is gene prediction?
  • Why is gene prediction important?
  • Ab initio gene prediction (w/o prior knowledge)
  • Comparative gene prediction (use other
    biological data)
  • Summary

47
Genome annotation is central to functional
genomics
ORFeome based functional genomics
Gene Knockout
RNAi phenotypes
Expression Microarray
48
Gene finding
  • Artemis genome viewer
  • Coding sequence vs non coding sequence
  • Gene finding software
  • Homology between species
  • ESTs

49
(No Transcript)
50
Pretty Handy Annotation Tool (PHAT)
  • Based on a generalised hidden Markov model (GHMM)
  • Free easily installed and run.
  • Is good at predicting multiexon genes but will in
    some cases miss out genes altogether and will
    over predict.
  • Cawley et al. (2001) Mol. Bio. Para. 118
    p167http//www.stat.berkeley.edu/users/scawley/Ph
    at/

51
Phat
http//linkage.rockefeller.edu/wli/gene/krogh98.pd
f
52
GlimmerM
  • Under predicts splicing
  • Hardly hardly ever misses a gene completely.
  • Does over predict.
  • Free with TIGR license

53
Comparison Of Gene Finders
Write a Comment
User Comments (0)
About PowerShow.com