Gene Structure and Identification - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Gene Structure and Identification

Description:

Gene Structure and Identification. Eukaryotic Genes and Genomes ... BCM Search launcher. Transcriptome Analyses. Microarray transcription analysis. Expression ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 46
Provided by: csta2
Category:

less

Transcript and Presenter's Notes

Title: Gene Structure and Identification


1
Gene Structure and Identification
  • Eukaryotic Genes and Genomes
  • Gene Finding

Assigned reading Ch 5 Prev. reading Ch 1, Ch 3,
Ch 11, Ch 12
BIO520 Bioinformatics Jim Lund
2
Complex Genome DNA
  • 10 highly repetitive (300 Mbp)
  • NOT GENES
  • 25 moderate repetitive (750 Mbp)
  • Some genes
  • 25 exons and introns (800 Mbp)
  • 40?
  • Regulatory regions
  • Intergenic regions

3
Eukaryotic Gene Expression
Promoter
Transcribed Region
Terminator
Enhancer
Transcription
RNA Polymerase II
Primary transcript 5
3
Intron1
Exon1
Exon2
Cap Splice Cleave/Polyadenylate
Translation
7mG
An
N
C
Transport
7mG
An
Polypeptide
4
Yeast
  • ORFSgenes!

Small ORFS (RNA genes) Regulatory Sequences
5
Eukaryotes, contd
  • large Eukaryotes
  • introns common, LONGER than exons
  • Promoter/enhancer
  • genome sparse
  • Fungi
  • introns common, short relative to exons
  • promoter/enhancer
  • genome dense

6
Intron Prevalence
of genes
Introns
7
Intron Size
of genes
Introns
8
Exon Size
of genes
Exon size (bps)
9
Fungi
  • Sew together exons
  • ORF regions
  • consensus sequences
  • domain/polypeptide matches

10
Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA
...CCACATTCAGAA... ...ProHisSerGlu...
11
Alternative Splice
CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP
12
Gene prediction targets
  • Internal exons (donor-acceptor)
  • Initial exons (5-donor)
  • Terminal exons (acceptor-3)
  • Single exon genes (5-3)

13
Gene prediction
  • Sequence based
  • Consensus sites
  • Signal sequences
  • Homology
  • Confirm prediction is a protein
  • Known coding sequences
  • cDNAs, SAGE
  • Comparative analysis
  • Identify exons, promoter/enhancer elements

14
Codon Bias/Nucleotide Frequency-useful?
  • High bias high confidence
  • Low bias low confidence

15
Codon Bias/Nucleotide Frequency-useful?
  • High bias high confidence
  • Low bias low confidence

16
Finding Functional Sequences
  • Known Consensus Sequences
  • Consensus Sequence Generation
  • Functional Tests

17
Consensus Inference
  • Position Weight Matrices
  • Sequence Logos
  • Hidden Markov Models

18
Translation Initiation Sites
19
Splicing Consensus
A64G73GTA62A68G84T63 Y80NY80Y87R75AY95C65AGNN V
ert
GTRNGT(N)30-1000 CTRAC(N)5-15YAG Fungi
Alternate Splicing!??
20
Linguistic Approach
  • Non-repetitive DNA!!
  • Long ORF
  • similar to known protein
  • ORF extended by reasonable splices
  • ORF begins with good ATG
  • Promoter/terminator flanks

21
DATABASE SEARCH
  • BLASTN
  • DNADNA comparison (ALWAYS!)
  • Not sensitive (DNA conservation low)
  • BLASTX/TBLASTX
  • ?6 frame ORFSpolypeptide database
  • 6 frames vs. 6 frames of a DNA database

www.ncbi.nlm.nih.gov
22
Protein Database Matches
  • Very helpful for the known
  • What about the unknown???

23
Transcript Initiation
  • Basal Promoters
  • Enhancers/Silencers/Regulatory Sites
  • Boundary elements?
  • Transcription Initation

Prokaryotes vs Eukaryotes Organism-to-Organism
24
Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831
  • TATA-box -25 to -30 TBP
  • CCAAT-box -212 to -57 CTF/NF1
  • GC-box -164 to 1 SP1
  • K C W K Y Y Y Y 1 to 5 cap signal

1
25
Basal Promoter Analysis
Cao and Moi, Ped Res 51415-421 (2002)
26
mRNA processing
  • Exon/Intron
  • Alternate splicing
  • Polyadenylation/Cleavage
  • Stability

27
PolyA sites
  • Metazoans
  • AATAAA, ATTAAA
  • 15-20 bps 5 of polyA addition site.
  • YGTGTTYY (diffusive GT-rich sequence)
  • 100-700 bps 3 UTR typical.
  • Yeast-different

28
Translation
  • Initation site
  • 1st AUG used 95 of the time.
  • Translational regulatory elements
  • translational enhancers
  • upstream ORFs

29
Tools-WWW
  • Genscan
  • Genie
  • GRAIL II integrated gene parsing
  • GenLang
  • HMMGene (lock ESTs, etc.)
  • GENEMARK

30
Hidden Markov Models
  • Probabilistic Models
  • Applicable to linear sequences
  • P(all states)1, infer probabilities of all
    states from observed (hidden states unobserved)
  • Work best when local correlations unimportant
  • Genefinding, phylogeny, secondary structure,
    genetic mapping
  • Pararmeters are set using a Training Set of
    gene annotations
  • Quantitative probabilities

31
Accuracy Assessment
PPpredicted coding PNpredicted
non-coding APreal positive ANreal
negatives TPnumber correct positive TNnumber
correct negative FPnumber false
positive FNnumber false negative
SnTP/AP SpTP/PP
AC ((TP/(TPFN)) (TP/(TPFP))
(TN/(TNFP)) (TN/(TNFN))) / 2 - 1
32
Accuracy Levels
Bp Exon
33
NEXT
  • Regulatory Sequences
  • Known Consensus Sequences
  • Consensus Sequence Generation
  • Functional (Lab) Data
  • Real examples

34
Gene Regulatory Sequences
  • Functional sites
  • Consensus
  • Experimental tests
  • Inferred sites
  • Transcriptome analysis

35
Regulatory Sites
  • Transcript initiation
  • mRNA processing
  • Translation sites

36
Regulatory Factors
  • lacI, trpR, CAP, araC.
  • GAL4, NDT80

Known from experiment Infer from genome? Infer
from expression data?
37
EUKARYOTES
  • More complex signals
  • More genes
  • More dispersed signals
  • Combinatoric regulation common

38
Enhancer Elements
  • DNA element Protein
  • Octamer OCT1, OCT2
  • ?B NF ?B
  • ATF ATF
  • AP1 AP1
  • ..

False , False -
39
Consensus Sequence Databases
  • WWW-based
  • TFD (transcription factor database)
  • BCM Search launcher

40
Transcriptome Analyses
  • Microarray transcription analysis
  • Expression
  • Transcription factor binding
  • MEME analysis of clusters

More later....
41
Practical Gene Finding
  • Use ALL tools
  • Comparative
  • BLASTN, BLASTX
  • Predictive Stitch together a consensus
  • HMM, GRAIL
  • ORF finders
  • Findpatterns (and WWW pattern searches)
  • cDNA OR protein OR genetic evidence

42
FRAMES-aldolase gene
43
If aldolase is so tough, how do you really do it?
  • Combine DNA sequence
  • with other data!

44
Genome-cDNA
P
DNA sequencing
Align (GAP)
cDNA
45
Comparative Genomics
  • Conservation of coding regions
  • Identification of transcription signals
  • words in common
  • Example-yeast comparisons
Write a Comment
User Comments (0)
About PowerShow.com