Gene Structure and Identification - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Gene Structure and Identification

Description:

Ave. distance between genes: 118bp. 318 aa, average protein length ... ORF Finder (NCBI) BCM Search Launcher... ORFs in E. coli. 1. 2. 3 -1 -2 -3. Frame. Codon Bias ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 32
Provided by: chucks96
Category:

less

Transcript and Presenter's Notes

Title: Gene Structure and Identification


1
Gene Structure and Identification
  • Genes and Genomes
  • ORFs and more
  • Consensus Sequences
  • Gene Finding

Reading sections 1.3, 9.1-9.6
BIO520 Bioinformatics Jim Lund
2
Gene
  • The functional and physical unit of heredity
    passed from parent to offspring. Genes are pieces
    of DNA, and most genes contain the information
    for making a specific protein.

3
Gene-Informatics
  • Genes are character strings embedded in much
    larger strings called the genome. A gene usually
    encodes a protein. Genes are composed of ordered
    elements associated with the fundamental genetic
    processes including transcription, splicing, and
    translation.

4
ACGT to Gene
  • Cells recognize genes from DNA sequence.

5
Genes
  • Protein Coding
  • RNA genes
  • rRNA
  • tRNA
  • siRNA, miRNA, snRNA, snoRNA

6
Genomes
  • Genome seq. has only limited use by itself
  • Markers, SNPs, etc.
  • Functional annotation
  • Identify proteins and their functions.
  • And regulatory regions, etc.
  • Parts list a source for understanding all
    biology--and ushers in the post-genomic age of
    biology.

7
Genomes
2002 Mus musculus 2,500,000,000
8
Characteristics of Protein Coding Genes
  • ORF
  • long (usually gt100 aa)
  • known proteins?likely
  • Basal signals
  • Transcription, splicing, translation
  • Regulatory signals
  • Depend on organism
  • Prokaryotes vs Eukaryotes
  • Verterbrate vs fungi, eg.

9
Infer Gene StructureGene Model
  • Promoter
  • Strength
  • Regulation
  • mRNA
  • Exons
  • Splicing
  • Stability
  • ORFprotein

10
GenomesGene Content
E. coli 4000 genes X 1 kbp/gene4 Mbp Genome4
Mbp!
Gene-rich
11
GenomesGene Content
Human 26,755 genes X 2 kbp54 Mbp mRNA
Introns300 Mbp? Regulatory regions300 Mbp?
2344 Mbp???
12
Complex Genome DNA
  • 10 highly repetitive (300 Mbp)
  • NOT GENES
  • 25 moderate repetitive (750 Mbp)
  • Some genes
  • 10 exons and introns (340 Mbp)
  • 55 ?
  • Regulatory regions
  • Intergenic regions

Hard!!
13
Easy problemBacterial Gene Finding
  • Dense Genomes
  • Short intergenic regions
  • Uninterrupted ORFs
  • Conserved signals
  • Abundant comparative information
  • Complete Genomes

14
E. coli genome
  • 4415 genes
  • Ave. distance between genes 118bp
  • 318 aa, average protein length
  • 57 proteins longer than 1000 aa.
  • 318 shorter than 100 aa.
  • 2584 operons, 70 contain one gene.
  • 1.5 repetitive DNA (mostly viral fragments).

15
Prokaryotic Gene Expression
Promoter
Cistron1
Cistron2
CistronN
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
N
N
C
N
C
C
1
2
3
Polypeptides
16
Prokaryotic gene prediction
  • ORFs
  • Biased nucleotide distribution
  • Periodicity of 3
  • Codon bias (codon usage statistics)
  • Also called Codon Adaptation Index (CAI).
  • Signal sequences
  • Homology
  • Other biological info for E. coli, partial
    N-terminal protein sequences.

17
Prokaryotic signal sequences
  • Ribosome binding site (RBS)/Shine-Delgarno
    element
  • 3-9 purines complementary to sequence at 3 end
    of the 16S rRNA in the small subunit of the
    ribosome.
  • Located 4-7 bps 5 of the AUG.
  • Promoter
  • -35 consensus site (TTGACA)
  • -10 consensus site (TATAAT)
  • Signal peptides
  • Regulatory protein binding sites (4 to 8bps)

18
ORFs
P(ORF)(61/64)n
P(20)(61/64)20.38
P(100)0.008
P(200)10-4
19
ORF finding tools
  • VectorNTI
  • Analyze/ORF
  • Testcode (Ficketts)
  • CodonPreference
  • WWW tools
  • ORF Finder (NCBI)
  • BCM Search Launcher...

20
ORFs in E. coli
Frame
1
2
3
-1
-2
-3
21
Codon Bias
  • Genetic code degenerate
  • Codon usage varies
  • Organism to organism
  • Gene to gene
  • High bias correlates with high level expression
  • Bias correlates with tRNA isoacceptors
  • Change bias or tRNAs, change expression

22
Codon Bias
Gly GGG 6 0.21 Gly GGA 6 0.17 Gly GGT 6 0.38 Gly G
GC 6 0.24
23
Codon Bias Gene Differences
GAL4 ADH1 Gly GGG 0.21 0 Gly GGA 0.17 0 Gly
GGT 0.38 0.93 Gly GGC 0.24 0.07
24
Nucleotide Bias
  • Coding DNA vs non-Coding DNA
  • often GC content higher than bulk
  • Empirical statistics (Ficketts TESTCODE)
  • Useful
  • ORF matches typical
  • organism, bias
  • ORF obscured by STOP codons

DNA sequence Errors?
25
We found ORFs-now what?
  • Work backwards
  • Locate adjacent cistrons
  • Locate RBS
  • Locate promoter
  • Locate terminator
  • Locate regulatory sites

26
Operon Structure
Promoter?
27
TranslationRibosome Binding Site, Shine-Dalgarno
Site
nnAGGAGGnnnnnATG Consensus not always used,
example E. coli gene nnAaGAGGnnnnATG
28
Bacterial Promoter
-35 T82T84G78A65C54A45 (16-18
bp) T80A95T45A60A50T96(A,G) -10 1
Alternate sigma factors CCCTTGAA.CCCGATNT
29
Terminators
  • Stem/loop
  • structural only
  • 3-U tail
  • Rho-independent
  • C-rich
  • G-poor
  • loose consensus
  • Rho-dependent

30
Difficulties in gene prediction
  • Frame shifts
  • sequencing errors
  • Overlapping ORFs
  • Rare (a few percent)
  • Short ORFs
  • Unusual genes
  • bp composition
  • signal sequences

31
Programs for prokaryotic gene prediction
  • Glimmer
  • ORPHEUS
  • GeneMark
  • 90 sensitivity and specificity
  • GENSCAN
  • Vector NTI (ORF analysis)
Write a Comment
User Comments (0)
About PowerShow.com