Identifying ORFs depends on searching for start and stop codons' - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Identifying ORFs depends on searching for start and stop codons'

Description:

Identifying ORFs depends on searching for start and stop codons' – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 40
Provided by: Instr171
Category:

less

Transcript and Presenter's Notes

Title: Identifying ORFs depends on searching for start and stop codons'


1
  • Identifying ORFs depends on searching for start
    and stop codons.
  • Confirmation that an ORF is discovered depends on
    the location of a promoter.
  • Failure to find a promoter suggests that start
    and stop codons are part of an operon.

2
  • In prokaryotes, the ribosome binds to nts 5 to
    the start site of translation.
  • These nts are called Shine-Delgarno sequences.
  • Typically the sequence is 5 AGGAGGU 3
  • Mutations in the Shine-Delgarno sequence can
    prevent mRNA from being translated.
  • The Shine-Delgarno sequence is an important
    marker for a ORF

3
  • A common practice is to use computer software to
    generate a translation from a DNA or mRNA
    sequence.
  • The process is called conceptual translation.
  • The program reads the DNA or mRNA in terms of
    codons.
  • Generates the protein sequence

4
  • The protein sequence can be compared to other
    known protein sequences in a proteomics database
  • Operons contain specific sequences for the
    termination of transcription.
  • These sequences are called intrinsic terminators

5
  • Intrinsic terminators have two features
  • (1) an inverted repeat, i.e.
  • 5-CGGATGCATCCG-3
  • 3-GCCTAC GTAGGC-5
  • (2) approximately 6 uracils following the
    inverted repeat

6
  • RNA molecules can form secondary structure.
    Usually due to base pairing within the inverted
    repeats.
  • Typically intrinsic terminators the inverted
    repeat is 7 20 nts in length and rich in GCs.
  • The significance of the secondary structure is
    that it causes the RNA polymerase to pause about
    1 minute
  • This is a very long time since the rate of
    incorporation is 100 nts/sec

7
  • If the sequence was between uracil and adenine
    (DNA template) Then the nascent mRNA would
    dissociate from the template and effectively end
    transcription.
  • So the pause is sufficient to terminate
    transcription.

8
  • The relative abundance of GC versus AT in a
    bacterial genome is a distinguishing attribute of
    bacterial genomes
  • Measuring GC content can be used to identify
    bacteria.
  • GC content varies from 25 to 75 across
    different prokaryote genomes.
  • The GC content of bacteria is altered over time
    by DNA polymerase and repair mechanisms

9
  • This leads to ratios of GC and AT regions within
    a bacteria genome being uniform.
  • Studying GC content has revealed that bacteria
    genes have been transferred as groups not single
    genes.
  • These sequences range in length from hundreds to
    thousands of nts.
  • The process of transferring genes is known as
    horizontal gene transfer.
  • This leads to bacterial genomes appearing to be
    patchworks of regions with distinctive GC
    content.

10
  • In the bacterial genome between 85 - 88 of the
    nts are associated with coding genes.
  • In E. coli, the average coding sequence length is
    950 bp and these sequences are separated by only
    118bp.
  • It is theorized that having such a compact genome
    may aid in the ability to quickly divide.
  • It is interesting that not only can bacteria
    acquire large GC sequences but they can loose
    them as well.

11
  • Bacterial genomes have the following
    characteristics
  • Long ORF (60 or more codons in length)
  • Matches to simple promoter sequences, RNA
    polymerase recognize the -35 and -10 sequences of
    promoters.
  • Recognizable transcriptional termination signal

12
  • Comparisons with nts (amino acids) with other
    bacterial species is very good
  • Eukaryotic genomes are very different than
    prokaryotic genomes.
  • Compartmentalization of cellular functions means
    more control is required.
  • Most eukaryotes are multicellular and though they
    have the same genome. The expression of genes is
    different.

13
  • Eukaryote genomes have more nts than seems to be
    required for genes and control sequences, i.e.
    junk DNA.
  • This leads eukaryote genomes to be more complex
    and flexible than prokaryote genomes.
  • Eukaryote genomes have two copies of each gene.
    In humans, the autosomes chromosomes 1 22 and
    the sex chromosomes
  • The size of the chromosomes is greater than
    prokaryotes.

14
  • The shortest chromosome is 55Mbs and the longest
    chromosome is 3,200 Mbs.
  • Sequencing eukaryote genomes is more difficult
    than prokaryote genomes.
  • Simply searching for overlapping fragments is not
    sufficient for sequencing eukaryote genomes.
  • The use of STS (sequence tagged sites) and EST
    (expressed sequence tags)
  • These sequences are determined from a physical
    map of the chromosome and then used to arrange
    the DNA fragments

15
(No Transcript)
16
  • Discovering eukaryote genes can be accomplished
    using software such as
  • Grail EXP and GenScan which rely on neural
    networks and dynamic programming techniques.
  • Neural networking relies on searching for
    statistical similarities between a known sequence
    and unknown sequence.
  • These programs search for intron/exon boundaries,
    promoters, and putative ORFs

17
  • Prokaryotes use a single RNA polymerase
  • While eukaryotes have three different RNA
    polymerases which are comprised of 8 to 12
    proteins.
  • Each eukaryote RNA polymerase recognizes a
    different promoter.
  • RNA polymerase I has promoter location -45 to 20
    and transcribes genes for ribosomal RNAs

18
  • RNA polymerase II binds to promoter far upstream
    to -25. The transcribed genes are
    protein-coding.
  • RNA polymerase III binds to promoter 50 to 100.
    The transcribed genes are tRNA and other small
    RNAs.
  • In eukaryotes, every gene has its own promoter.
    Unlike prokaryotes which can have one promoter
    for multiple genes.

19
  • RNA polymerase II promoters contain a sequence
    known as the basal promoter.
  • This is the location where the RNA polymerase II
    initiation complex is assembled and transcription
    begins.
  • In addition, there are upstream promoter elements
    to which proteins other than RNA polymerase II
    bind.
  • It is estimated that 5 upstream promoter elements
    are required to uniquely identify any particular
    gene.
  • Note, assembly of the initiation complex can
    occur with only the core promoter without the
    proteins associated with the upstream elements
    but it is inefficient.

20
  • RNA polymerase II does not bind directly to the
    basal promoter unlike the prokaryote RNA
    polymerase.
  • Instead basal transcription factors bind to the
    RNA polymerase II and the promoter.
  • These proteins include a TATA-binding protein
    (TBP) and at least 12 TBP-associated factors
    (TAF)
  • Just like prokaryotes, eukaryotes have a TATA box
  • The sequence is 5-TATAWAW-3 where W is a either
    a A or T at same frequency.

21
  • The transcription start site (1) is associated
    with an initiator (Inr) sequence
  • 5-YYCARR-3 where Y C or T and R G or A
  • The 1 nt is almost always A.
  • Differences in basal transcription factors exist
    between cell types in the same organism.
  • This leads to differential expression of genes
    according to cell type.

22
  • The presence of Inr and TATA box are important
    eukaryotic markers for putative genes.
  • In prokaryotes, the RNA polymerase has a strong
    affinity for the promoter. The same is not true
    for eukaryotes.
  • Eukaryotes have a low rate of transcription
    irregardless of the promoter sequence.
  • So eukaryotes depend on proteins that act as
    positive regulators to help in transcription.

23
  • Some of these positive regulatory proteins are
    constitutive. That is they work on various genes
    and do not respond to external signals.
  • While others are regulatory. These interact with
    specific genes and do respond to external
    signals.
  • Collectively these proteins are called
    transcription factors.
  • Most transcription factors bind to specific DNA
    sequences

24
(No Transcript)
25
  • Some transcription factors recognize consensus
    sequences close to the transcription start.
  • A good example is the CAAT box at position -80 in
    most eukaryotic gene. This sequence is always
    found in the same orientation.
  • While others are located further away from the
    transcription start site. These are generally
    called enhancers.
  • Enhancers are not orientation specific and can be
    -500 to 500 from the transcription start site.

26
  • Interestingly, some enhancers are located tens of
    thousands of nts upstream from the transcription
    start site.
  • These enhancers bend the DNA into specific shape
    that brings the transcription factors together.
  • This is called an enhanceosome.
  • Other enhancers help mediate response to heat,
    control differential gene expression in different
    cell types or during development.

27
  • The nucleus is physical barrier that allows the
    RNA transcript to be modified before it reaches
    the ribosome.
  • These unmodified RNA transcripts are called hnRNA
  • The modifications are capping, slicing and
    polyadenylation
  • Capping occurs at the 5 end (includes
    methylation) and is performed on all hnRNA.
  • Splicing is the removal of introns and connecting
    of exons
  • Polyadenylation is the placement of 250 Adenines
    at the 3 end of hnRNA
  • These modifications are cell type specific.

28
  • A major difficulty for the Bioinformatician is
    the ORF length does not suggest a protein coding
    sequence.
  • Splicing can remove stop codons in the introns.
  • A possibility is to search for exon/intron
    boundaries. However, the rules for splicing can
    vary according to cell type.

29
Introns and Exons
  • 8 different kinds of introns have been found in
    eukaryotes
  • The GU-AG rule is useful for predicting protein
    coding sequences.
  • In the GU-AG rule, the first two nts at the 5
    end of an intron are GU and AG is at the 3 end.
  • There is also an internal branch site located
    18 to 40 bp upstream of the 3 splice junction.
  • Most information introns is within the intron
    itself and not the exon

30
  • Typically an intron has a minimal length of 60
    nts. There does not seem to be a maximum length.
  • Exons can vary in length from 450 nts to over
    2000 nts. There some examples of exons less than
    100 but these are not the rule.
  • Genes do not have a minimum number of introns.
    Some genes have over 100.
  • Thus the length and number of introns does not
    seem important.
  • However the position of the introns in genes is
    highly conserved. Suggesting that it is important

31
  • The 5 and 3 splice junctions seem to be the
    same. Yet is known that in some genes splicing
    occurs within only one intron.
  • It is not known how the splicing proteins can
    differentiate between different 5 and 3 splice
    junctions.
  • It doe not seem to based on the hnRNA sequence.

32
  • Most hnRNA is spliced into one type of mRNA for
    all of the genes coding for the mRNA.
  • Estimates are that 20 of the genes are
    alternatively spliced.
  • The actual molecules involved in splicing are
    snRNA (small nuclear RNA) and additional
    proteins.
  • It is possible that the variability in consensus
    sequences of splice junctions and branch sites
    maybe necessary for different snRNAs

33
  • GC content in eukaryotic genomes is not as varied
    in prokaryotes.
  • Computer algorithms use GC content for the
    following reasons
  • There is correlation between GC content and
    genes, upstream promoter sequences, codon choice,
    gene length and gene density

34
  • In eukaryote genomes the sequence CG occurs only
    20 of the expected frequency based on chance
    alone.
  • Genes were found that had a 1 -2 kb sequence of
    GCs at the 5 end.
  • These sequences were located at -1500 500 in the
    gene.
  • These GC sequences are called CpG islands

35
  • Many of the GCs are part of the binding sites for
    transcription factors, such as Sp1.
  • In the human genome, there are 45,000 CpG islands
    and half are associated with known house-keeping
    genes.
  • House-keeping genes are expressed in all tissues
    and at high levels.
  • While others are associated with promoter
    sequences

36
  • CpG islands are associated with methylation
  • In methylation, DNA methylase attaches methyl
    groups to cytosine.
  • The cytosine must occur as 5-CG-3 dinucleotides
  • Methylated cytosine have a high rate of mutations
  • High levels of methylated CG is associated with
    low levels of acetylated histones and vice versa

37
  • High levels of acetylated histones is associated
    with gene expression.
  • Methylated cytosines can suppress transcription
    when located in promoter sequences
  • Interestingly methylation patterns vary from one
    cell type to another.

38
  • Histones are positively charged proteins that
    have a high affinity for DNA (negatively charged)
  • The positive charge on histones is reduced by the
    addition of acetylation.
  • Histones whose affinity for DNA is less allows
    the DNA to be more accessible to RNA polymerase.

39
  • Areas of active transcription are euchromatin
  • Areas of inactive transcription are
    heterochromatin
  • Heterochromatin is densely packed DNA and
    euchromatin is the opposite
Write a Comment
User Comments (0)
About PowerShow.com