Biological Motivation Gene Finding - PowerPoint PPT Presentation

About This Presentation
Title:

Biological Motivation Gene Finding

Description:

Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones * Gene Finding Why do it? Find and annotate all the genes within the large volume of DNA sequence ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 28
Provided by: AnneH174
Category:

less

Transcript and Presenter's Notes

Title: Biological Motivation Gene Finding


1
Biological MotivationGene Finding
  • Anne R. Haake
  • Rhys Price Jones

2
Gene Finding
  • Why do it?
  • Find and annotate all the genes within the large
    volume of DNA sequence data
  • how many genes in an organism? homologies?
  •  
  • Gain understanding of problems in basic science
  • e.g. gene regulation-what are the mechanisms
    involved in transcription, splicing, etc?
  • Different emphasis in these goals has some effect
    on the design of computational approaches for
    gene finding.

3
Gene Finding by Biological Methods
  • Extract mRNA
    reverse
  • transcribe
    cDNA
  • Label cDNA
  • Detecting by using
    cDNA probe
  • Gene found

DNA library
4
Gene Finding by Computational Methods
  • Dependent on good experimental data to build
    reliable predictive models
  • Various aspects of gene structure/function
    provide information used in gene finding programs

5
Figure 12.3
  • Figure 12.3

6
The Informatics View of Genes
  • Genes are character strings embedded in much
    larger strings called the genome
  • Genes are composed of ordered elements associated
    with the fundamental genetic processes including
    transcription, splicing, and translation.

7
Gene Finding
  • Cells recognize genes from DNA sequence
  • find genes via their bioprocesses
  • Not so easy for us..

8

CTAGCAGGGACCCCAGCGCCC
GAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAA
CTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGC
CCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAA
GCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTG
GGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAAC
TGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGA
ATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCT
AAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTT
TTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGT
GCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTC
AGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTA
TTGTTATGAGACTGGATATAT...
9

G CTAGCAGGGACCCCAGCGCCCGAGAGACCAT
GCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCA
GGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGA
GGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTA
AAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAG
AATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGAT
TGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGAC
TCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAA
AAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTC
ATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTG
CATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAA
AATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAG
ACTGGATATAT...
10
Types of Genes
  • Protein coding
  • most genes
  • RNA genes
  • rRNA
  • tRNA
  • snRNA (small nuclear RNA)
  • snoRNA (small nucleolar RNA)

11
3 Major Categories of Information used in Gene
Finding Programs
  • Signals/features a sequence pattern with
    functional significance e.g. splice donor
    acceptor sites, start and stop codons, promoter
    features such as TATA boxes, TF binding sites,
    CpG islands
  • Content/composition -statistical properties of
    coding vs. non-coding regions.
  • e.g. codon-bias length of ORFs in prokaryotesGC
    content
  • Similarity-compare DNA sequence to known
    sequences in database
  • Not only known proteins but also ESTs, cDNAs

12
Looking for Protein Coding Genes
  • Look for ORF (begins with start codon, ends with
    stop codon, no internal stops!)
  • long (usually gt 60-100 aa)
  • If homologous to known protein more likely
  • Look for basal signals
  • Transcription, splicing, translation
  • Look for regulatory signals
  • Depends on organism
  • Prokaryotes vs Eukaryotes
  • Vertebrate vs fungi

13
Easier problemGene Finding in Bacterial Genomes
  • Why?
  • Dense Genomes
  • Short intergenic regions
  • Uninterrupted ORFs
  • Conserved signals
  • Abundant comparative information
  • Complete Genomes available for many

14
What do Prokaryotic Genes look like?
5
3
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
15
Prokaryotic Gene Expression
Promoter
Cistron1
Cistron2
CistronN
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
SD in polycistronic message
N
N
C
N
C
C
1
2
3
Polypeptides
Slide modified from http//biology.uky.edu/520/Le
cture/lect8/lect8Notes.ppt
16
Open Reading Frame (ORF)
  • Any stretch of DNA that potentially encodes a
    protein
  • The identification of an ORF is the first
    indication that a segment of DNA may be part of a
    functional gene

17
Open Reading Frames
  • Each grouping of the nucleotides into consecutive
    triplets constitutes a reading frame. There are
    three different reading frames in the 5-gt3
    direction and a further three in the reverse
    direction on the opposite strand.
  • A sequence of triplets that contains no stop
    codon is an Open Reading Frame (ORF)

A C G T A A C T G A C T A G G T G A A T
CGT AAC TGA CTA GGT GAA
GTA ACT GAC TAG GTG AAT
18
ORFs as gene candidates
  • An open reading frame that begins with a start
    codon (usually ATG, GTG or TTG, but this is
    species-dependent)
  • Most prokaryotic genes code for proteins that are
    60 or more amino acids in length
  • The probability that a random sequence of
    nucleotides of length n has no stop codons is
    (61/64)n
  • When n is 50, there is a probability of 92 that
    the random sequence contains a stop codon
  • When n is 100, this probability exceeds 99

19
Codon Bias
  • Genetic code degenerate
  • Equivalent triplet codons code for the same amino
    acid
  • http//www.pangloss.com/seidel/Protocols/codon.htm
    l
  • Codon usage varies
  • organism to organism
  • gene to gene
  • Biological basis
  • Avoidance of codons similar to stop
  • Preference for codons that correspond to abundant
    tRNAs within the organism

20
Codon Bias Gene Differences
GAL4 ADH1 Gly GGG 0.21 0 Gly GGA 0.17 0 Gly
GGT 0.38 0.93 Gly GGC 0.24 0.07
Slide modified from http//biology.uky.edu/520/Le
cture/lect8/lect8Notes.ppt
21
Codon BiasOrganism differences
  • Yeast Genome arg specified by AGA 48 of time
    (other five equivalent codons 10 each)
  • Fruitfly Genome arg specified by CGC 33 of time
    (other five 13 each)
  • Complete set of codon usage biases can be found
    at

http//www.kazusa.or.jp/codon/
22
GC content
  • GC relative to AT is a distinguishing factor of
    bacterial genomes
  • Varies dramatically across species
  • Serves as a means to identify bacterial species
  • For various biological reasons
  • Mutational bias of particular DNA polymerases
  • DNA repair mechanisms
  • horizontal gene transfer (transformation,
    transduction, conjugation)

23
GC Content
  • GC content may be different in recently acquired
    genes than elsewhere
  • This can lead to variations in the frequency of
    codon usage within coding regions
  • There may be significant differences in codon
    bias within different genes of a single
    bacteriums genome

24
Ribosome Binding Sites
  • RBS is also known as a Shine-Dalgarno sequence
    (species-dependent) that should bind well with
    the 3 end of 16S rRNA (part of the ribosome)
  • Usually found within 4-18 nucleotides of the
    start codon of a true gene

25
Shine-Dalgarno Sequence
  • Is a nucleotide sequence (consensus AGGAGG)
    that is present in the 5'-untranslated region of
    prokaryotic mRNAs.
  • This sequence serves as a binding site for
    ribosomes and is thought to influence the reading
    frame.
  • If a subsequence aligning well with the
    Shine-Dalgarno sequence is found within 4-18
    nucleotides of an ORFs start codon, that
    improves the ORFs candidacy.

26
Bacterial Promoter
-35 T82T84G78A65C54A45 (16-18
bp) T80A95T45A60A50T96(A,G) -10
1 Not so simple remember, these are consensus
sequences
27
Termination Sequences
  • 3-U tail
  • Stem/loop
  • Inverted repeat immediately preceding the runs of
    uracil
  • Termination sequence
Write a Comment
User Comments (0)
About PowerShow.com