Course Outline - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Course Outline

Description:

Course Outline – PowerPoint PPT presentation

Number of Views:629
Avg rating:3.0/5.0
Slides: 62
Provided by: Ben5152
Category:
Tags: course | fugu | outline

less

Transcript and Presenter's Notes

Title: Course Outline


1
Course Outline
2
What Happens when HGP is Completed ?
  • Known gene number, location, function and
    regulation.
  • Chromosome structure and organization.
  • Non-coding DNA types, amount, distribution and
    function.
  • Gene expression, protein synthesis, and
    post-translation events,
  • protein interactions and networks.
  • Evolutionary conservation among organisms,
    (structure and function).
  • Correlation of SNPs with health and disease.
  • Genes involved in complex traits and multi-gene
    diseases.

3
The Next Step
Locate all the genes and describe their
function. This will probably take another 15-20
years !
4
Reminder Genome Browsers and Gene Prediction
http//genome.ucsc.edu/
5
Eukaryotes vs Prokaryotes
Typical human bacterial cells drawn to
scale.
  • Eukaryotic cells are
  • characterized by
  • membrane-bound
  • compartments,
  • which are absent
  • in prokaryotes.

BIOS Scientific Publishers Ltd, 1999
6
Gene Prediction is Different for Eukaryotes and
Prokaryotes
  • Prokaryotic genes
  • Eukaryotic genes

http//csbl.bmb.uga.edu/resources/slides/gene-find
ing-2.ppt275,4,Review
7
Eukaryotes Splice Signals
8
The operon model of prokaryotic gene regulation
  • Small genomes, high gene density -
  • Example Haemophilus influenza genome
  • 85 genic.
  • Groups of genes coding for related
  • proteins are arranged in units known as
  • operons.
  • No introns - One gene, one protein.
  • Open reading frames - One ORF per gene, ORFs
    begin with start, end with stop codon.

http//www.emc.maricopa.edu/faculty/farabee/BIOBK/
BioBookGENCTRL.html
9
Gene finding is simple !
http//tardigrada.cap.ed.ac.uk/teaching/genomics/t
ech2/img2.htm
10
Prokaryotic Gene Prediction Tools
  • Glimmer
  • http//cbcb.umd.edu/software/glimmer/
  • GeneMark
  • http//exon.gatech.edu/genemark/genemark_prok_gms_
    plus.cgi
  • ORNL Annotation Pipeline
  • http//compbio.ornl.gov/GP3/pro.html
  • FramePlot - protein-coding region prediction tool
    for high GC-content bacteria
  • http//www.nih.go.jp/jun/cgi-bin/frameplot.pl

11
Gene Prediction in Eukaryotes
  • Complex gene structure.
  • Large genomes (0.1 to 3 billion bases).
  • Exons and introns (interrupted).
  • Low coding density (lt30).
  • 2-3 in humans, 25 in Fugu, 60 in yeast
  • Alternate splicing (40-60 of all genes).
  • Considerable number of pseudogenes.

NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 8 AUGUST
2007
12
Gene Prediction in Eukaryotes
13
Non-coding In Genomes
http//fig.cox.miami.edu/cmallery/150/gene/noncod
ing.genes.jpg
14
Non-Protein Coding Gene Tools
  • tRNA
  • tRNA-ScanSE (http//www.genetics.wustl.edu/eddy/tR
    NAscan-SE/)
  • FAStRNA (http//bioweb.pasteur.fr/seqanal/interfac
    es/fastrna.html)
  • snoRNA
  • snoRNA database (http//rna.wustl.edu/snoRNAdb/)
  • microRNA
  • Sfold (http//www.bioinfo.rpi.edu/applications/sfo
    ld/index.pl)
  • SIRNA (http//bioweb.pasteur.fr/seqanal/interfaces
    /sirna.html)

http//harlequin.jax.org/GenomeAnalysis/GeneFindin
g04.ppt257,38,Non-protein Coding Gene Tools and
Information
15
Approaches to Gene Finding
  • Direct Look for something that looks like one
    gene (homology) - Homology based gene prediction
  • Exact or near-exact matches (similarity searches)
    of
  • EST, cDNA, or proteins from the same, or closely
    related
  • organisms (comparative genomics).
  • Indirect
  • Look for something that looks like all genes (ab
    initio).
  • Hybrid
  • Combine homology and ab initio.

16
(No Transcript)
17
(No Transcript)
18
General Things to Remember about
(Protein-Coding) Gene Prediction Software
  • Work best on genes that are reasonably similar to
    something seen previously.
  • Finds protein coding regions far better than
    non-coding regions. First and last exons are
    difficult to annotate because they contain UTRs.
    Small genes are not statistically significant and
    therefore hard to predict.
  • In the absence of external (direct) information,
    alternative forms will not be identified.
  • It is imperfect ! (Its biology, after all).

19
Finding Eukaryotic Genes Computationally
  • Gene finding based on homology evidence BLAST,
    FASTA, BLAT etc.
  • Content-based Methods
  • CpG islands, GC content, hexamer repeats,
    composition statistics, codon frequencies
  • Feature-based Methods
  • donor sites, acceptor sites, promoter sites,
    start/stop codons, polyA signals, feature lengths
  • Similarity-based Methods
  • sequence homology (DNA, protein), EST searches
  • Pattern-based
  • HMMs, Artificial Neural Networks
  • Most effective is a combination of all the above !

20
Content-Based Methods
  • CpG islands (in and near approximately 40 of
    promoters of mammalian genes (about 70 in human
    promoters)
  • Very abundant near gene start site
  • High GC content found in 5 ends of genes
  • Codon Bias
  • Some codons are strongly preferred in coding
    regions, others are not
  • Positional Bias
  • 3rd base tends to be G/C rich in coding regions

21
Feature-Based Methods
  • Based on identifying gene signals (promoter
    elements, splice sites, start/stop codons, polyA
    sites, etc.)
  • Wide range of methods
  • Consensus sequences
  • Weight matrices
  • Neural networks
  • Decision trees
  • Hidden Markov Models (HMMs)

22
Eukaryotic Genomic Hints that a Gene is Nearby
  • PolII RNA promoter elements
  • GC box, TATA box, CCAAT region.
  • Kozak consensus sequence (eukaryotic ribosome
    binding site-RBS consensus (gcc)gccRccAUGG,
    where R is a purine (adenine or guanine) three
    bases upstream of the start codon (AUG).
  • Splicing signals donor, acceptor.
  • Termination signal.
  • Polyadenylation signal.
  • Promoter elements.

23
Pol II Promoter Elements
  • 5 cap region/signal
  • (a modified guanine nucleotide that has been
    added to the "front" or 5' end of a eukaryotic
    messenger RNA shortly after the start of
    transcription, important for ribosome
    recognition).
  • nCAGTnG
  • TATA box (30 bp upstream)
  • - TATAAA
  • CCAAT box (80 bp upstream)
  • - TAGCCAATG
  • GC box (200 bp upstream)
  • - GGGCGG
  • None of these are essential for gene expression
  • Each of these may have more than one copy, except
    the TATA box, to produce greater effect
  • There are other promoter elements

24
Control of Gene Expression Promoter Prediction
25
How does it Works Motif Identification
  • Exon-Intron Borders Splice Sites

Exon Intron
Exon  gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/ Tools for genome analysis
http//www-btls.jst.go.jp/cgi-bin/Tools/index.cgi?
langen
http//www.seas.gwu.edu/simhaweb/cs177/spring2004
/zeeberglecture1Part2.ppt317,25,How it works I
Motif identification
26
Prediction of Splice Junction Sites
Splice site prediction tools - 3 splice site
CAG/GT 5 splice
site MAG/GTRAGT
(M is A or C R is A or G). http//l25.itba.mi.cn
r.it/webgene/wwwspliceview.html
http//dot.imgen.bcm.tmc.edu9331/seq-search/gene
-search.html Splice predictor
http//bioinformatics.iastate.edu/cgi-bin/sp.cgi
Splice site prediction by Neural Network
http//www.fruitfly.org/seq_tools/splice.html
27
Prediction of Translation Start Site
  • Translation start ATG
  • How to predict a translation start

ATG
GCCATGGCGA .. ACGATGCTGT . GACATGGTAC
AGGATGGGCT GCGATGTGGC
AUG codon finder tool http//www.cbs.dtu.dk/
services/NetStart/ http//l25.itba.mi.cnr.it/web
gene/wwwaug.html
28
Polyadenylation Termination Signals
  • Polyadenylation signal
  • AA/TTAAA
  • Located 20 bp upstream of poly-A cleavage site
  • Termination Signal
  • AGTGTTCA
  • Located 30 bp downstream of poly-A cleavage site

29
Why Polyadenylation is Really Useful
Complementary Base Pairing
AAAAAAAAAAA TTTTTTTTTTT
30
Links for Gene Finding Software
POLY-A signal prediction - 3 end of most
eukaryotic mRNAs (60-200 residues),
post-transcription added http//dot.imgen.bcm.tmc
.edu9331/seq-search/gene-search.html
http//l25.itba.mi.cnr.it/genebin/wwwHC_POLYA
http//genomic.sanger.ac.uk/gf/gf.shtml http//ru
lai.cshl.org/tools/polyadq/polyadq_form.html
Repeated elements (Repeat Masker) http//ftp.ge
nome.washington.edu/RM/RepeatMasker.html
http//l25.itba.mi.cnr.it/genebin/wwwrepeat.pl
GC rich areas, http//rulai.cshl.org/tools/CpG_p
romoter/
31
The Annotation Pipeline
  • Mask repeats using RepeatMasker
    (http//www.repeatmasker.org/).
  • Run sequence through several gene prediction
    programs.
  • Validation
  • Take predicted genes and do similarity search
    against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to
    find ncRNA.

32
Eukaryotic Gene Prediction Tools
  • Genscan (ab initio), GenomeScan (hybrid)
  • (http//genes.mit.edu/)
  • (http//genes.mit.edu/genomescan.html)
  • Twinscan (hybrid)
  • (http//genes.cs.wustl.edu/)
  • FGENESH (ab initio)
  • (http//www.softberry.com/berry.phtml?topicgfind)
  • GeneMark.hmm (ab initio)
  • (http//opal.biology.gatech.edu/GeneMark/eukhmm.cg
    i)
  • MZEF (ab initio)
  • (http//rulai.cshl.org/tools/genefinder/)
  • GrailEXP (hybrid)
  • (http//grail.lsd.ornl.gov/grailexp/)
  • GeneID (hybrid)
  • (http//www1.imim.es/software/geneid/geneid.html
    )
  • FirstEF
  • - http//rulai.cshl.org/tools/FirstEF/

33
Gene Finding Software Promoter Prediction
Promoter Prediction McPromoter
http//genes.mit.edu/McPromoter.html Human
Promoter Prediction http//www.softberry.com/berr
y.phtml?topicfpromgroupprogramssubgrouppromot
er Consite tool http//asp.ii.uib.no8090/cgi-bi
n/CONSITE/consite?rmt_input_single DNA Motif
search (in TRANSFAC) http//motif.genome.jp/
TFBind http//tfbind.ims.u-tokyo.ac.jp/
TFSearch http//www.cbrc.jp/research/db/TFSEARCH
.html http//molsun1.cbrc.aist.go.jp/research/db/
TFSEARCH.html Promoter prediction - DNA region
that RNA polymerase binds before initiating
transcription (TATA box prediction) http//www.fr
uitfly.org/seq_tools/promoter.html Transcription
factor binding sites prediction http//www.cbil.u
penn.edu/tess/index.html
34
Example - gene finding tool http//genes.mit.e
du/genomescan.html
35
When selecting ORF, you can blast the amino
acid sequence and get a protein.
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
6 frames
36
GENSCAN output for sequence
Optimal exon
Initial exon
Internal exon
Terminal exon
Single Exon gene
Sub-Optimal exon
http//genes.mit.edu/
37
Gene Prediction Pipeline - Example
  • Get a genomic sequence from genome browser
  • BRCA1 genomic NG_005905 Fasta format.
  • Apply to gene prediction tools provided
  • GENSCAN http//genes.mit.edu/GENSCAN.html
  • ORFinder http//www.ncbi.nlm.nih.gov/gorf/go
    rf.html
  • Also, worth trying
  • GeneBuilder http//l25.itba.mi.cnr.it/webgene/g
    enebuilder.html
  • Compare results.
  • Try CONSITE example for Transcription factor
    recognition of promoter region
  • (http//asp.ii.uib.no8090/cgi-bin/CONSITE/consite
    ?rmt_input_single)

Step-by-step practical exercise for gene
annotation http//www1.imim.es/eblanco/seminars/
docs/Reus2005/exercise1/index.html
38
Annotation of Putative Genes
Annotation category
  • Matches known protein sequence
  • Strong similarity to protein sequence
  • Similar to known protein
  • Similar to unknown protein
  • Similar to EST (i.e., putative protein)
  • No EST or protein matches (i.e., hypothetical
    protein)

39
Sequencing and Assembling a Genome
  • To sequence a genome, the first task is to cut it
    into many small, overlapping pieces.
  • Then clone each piece.

http//media.wiley.com/assets/1312/61/chapter05.pp
t317,15,Sequencing and Assembling a Genome (I)
40
Sequencing and Assembling a Genome
  • Each piece must be sequenced.
  • Sequencing machines cannot do an entire sequence
    at once.
  • They can only produce short sequences smaller
    than 1 Kb.
  • These pieces are called reads.
  • It is necessary to assemble the reads into
    contigs.

http//media.wiley.com/assets/1312/61/chapter05.pp
t322,16,Sequencing and Assembling a Genome (II)
41
Cleaning DNA Sequences
  • In order to sequence genomes, DNA sequences are
    often cloned in a vector (plasmid, YAC, or
    cosmide).
  • Sequences of the vector can be mixed with your
    DNA sequence.
  • Before working with your DNA sequence, you should
    always clean it with VecScreen

http//www.ncbi.nlm.nih.gov/VecScreen/VecScreen.ht
ml
42
Gene Assembly
Usually genes are found in fragments, and need to
be assembled Simple example input
ACCGT CGTGC TTAC TACCGT
Output --ACCGT-- ----CGTGC
TTAC----- -TACCGT--
Spaces are ignored. Fragments are
assembled according to overlapping areas.
___________ TTACCGTGC
A multiple alignment of the fragments result in
consensus sequence.
Assembled piece 9 bases
43
Sequence Assembly - Real life is more complicated
Errors in sequencing Example
Output --ACCGT--
----CGTGC TTAC-----
-TGCCGT--
Input ACCFT CGTGC TTAC TGCCGT
TTACCGTGC
The consensus is still correct because of
majority voting.
Insertion error Example
Output
--ACC-GT-- ----CAGTGC TTAC------ -TACC-GT--
Input ACCGT CAGTGC TTAC TACCGT
TTACC?GTGC
44
Gene Assembly - Real life is more complicated
Deletion error Example Input ACCGT Output
--ACCGT-- CGTGC
----CGTGC TTAC TTAC----- TACGT
-TAC_GT--
Repeats
TTACCGTGC
x
x
x
Lack of coverage
Target DNA
Uncovered area
45
Sequence Assembly
CAP3 http//pbil.univ-lyon1.fr/cap3.php http//bi
oweb.pasteur.fr/seqanal/interfaces/cap3.html
http//deepc2.psi.iastate.edu/aat/cap/cap.html
Example sequences (seq) http//genome.cs.mtu.edu/
cap/data/
46
Compute a Restriction Map
NEBcutter V2.0 http//tools.neb.com/NEBcutter2/in
dex.php http//bioinformatics.org/sms2/rest_map.ht
ml WatCut An on-line tool for restriction
analysis, silent mutation scanning, and SNP-RFLP
analysis http//watcut.uwaterloo.ca/watcut/watcut
/template.php Sequence Extractor generates a
clickable restriction map and PCR primer map of a
DNA sequence. Protein translations and
intron/exon boundaries are also shown
http//bioinformatics.org/seqext/
47
Perform PCR Using a Computer
  • Polymerase Chain Reaction (PCR) is a method for
    amplifying DNA.
  • PCR is used for many applications, including
  • Gene cloning
  • Forensic analysis and paternity tests
  • PCR amplifies the DNA between two anchors called
    PCR primers.
  • Hydrogen bonding of single-stranded nucleic acids
    is referred to as "annealing two complementary
    sequences will form hydrogen bonds between their
    complementary bases (G to C, and A to T or U) and
    form a stable double-stranded molecule.

primers
http//media.wiley.com/assets/1312/61/chapter05.pp
t309,6,Making PCR with a Computer
48
Degenerate Primers
Degenerate Primers are used for amplification of
sequences from different organisms a set of
primers which have a number of options at several
positions in the sequence so as to allow
annealing to and amplification of a variety of
related sequences.
http//www.mcb.uct.ac.za/pcroptim.htm http//www.s
cielo.br/img/revistas/mioc/v100n6/a08fig01.gif
49
PCR Primer Design 10 Rules
1.  Primer length should be 18-22 bases in
length This length is long enough for adequate
specificity, and short enough for primers to bind
easily to the template at the annealing
temperature.   2.  Base composition determines
DNA stability (40-60 GC content higher GC
makes more stable DNA)   3.  GC clamp Primers
should end (3') in a G or C, or CG or GC this
promotes specific binding   4.  Runs of three or
more Cs or Gs at the 3'-ends of primers may
promote mis-priming at G or C-rich sequences
(because of stability of annealing), and should
be avoided.   http//www.premierbiosoft.com/tech
_notes/PCR_Primer_Design.html
50
PCR Primer Design 10 Rules
5.  Primer Tms (melting temp.) between 52-58oC
are preferred definition the temperature at
which one half of the DNA duplex will dissociate
to become single stranded and indicates the
duplex stability.   Annealing temp
determines reaction specificity. 6.  Secondary
structure 3'-ends of primers should not be
complementary (ie. base pair), as otherwise
primer self dimers will be synthesized
preferentially to any other product  7. 
Secondary structure Cross dimers formed
between pairs of primers. 8. Target sequence
Amplicon length (usually 100-500 bps). 9. Check
cross homology with related genes and possible
pseudo-genes. Amplicon location (distance from
3'end). 10. Intron spanning.
http//www.premierbiosoft.com/tech_notes/PCR_Prime
r_Design.html
51
Primer Design Tools
Primer3 http//biotools.umassmed.edu/bioapps/pr
imer3_www.cgi PrimerQuest (also option for
RT-PCR) http//www.idtdna.com/Scitools/Applicatio
ns/PrimerQuest PCR primers based upon
multialignments (PriFi use demo) http//cgi-www.
daimi.au.dk/cgi-chili/PriFi/main?config.x101conf
ig.y30 PCR primers designed from protein
multiple sequence alignments http//bioinformatics
.weizmann.ac.il/blocks/codehop.html GeneFisher
input single or multiple sequence(s), either
nucleotide or amino acid. For multiple sequences
a multiple alignment will be calculated (or
upload already aligned sequences).
http//bibiserv.techfak.uni-bielefeld.de/genefishe
r2/submission.html (example sequences in
web-site).
52
Example
53
Special Features Primer Design
Create overlapping PCR products in large
sequences. http//www2.eur.nl/fgg/kgen/primer/Over
lapping_Primers.html Creating primers around
exons in genomic DNA. http//www2.eur.nl/fgg/kgen/
primer/Genomic_Primers.html Creating primers
around SNPs in genomic DNA. http//www2.eur.nl/fg
g/kgen/primer/SNP_Primers.html Creating primers
around the Open Reading Frame of cDNAs.
http//www2.eur.nl/fgg/kgen/primer/cDNA_Primers.h
tml
54
Examine Primer Properties
NetPrimer http//www.premierbiosoft.com/netprime
r/netprlaunch/netprlaunch.html Blast primer
sequence against NCBI and provide primer
propertieshttp//www.idtdna.com/analyzer/Applica
tions/OligoAnalyzer/ OligoCalculator
Oligonucleotide properties calculator http//www.
basic.northwestern.edu/biotools/oligocalc.html
OligoAnalyzer http//www.idtdna.com/analyzer/Ap
plications/OligoAnalyzer/
55
Other Primer Utilities
UCSC In-Silico PCR UCSC http//genome.ucsc.ed
u/cgi-bin/hgPcr?commandstart Electronic PCR
(e-PCR) is computational procedure that is used
to identify sequence tagged sites(STSs), within
DNA sequences http//www.ncbi.nlm.nih.gov/sutils/
e-pcr/ In silico simulation of molecular
biology experiments http//insilico.ehu.es/
http//insilico.ehu.es/PCR/
56
Primers Design Pipeline
  • Use accurate sequence data as much as possible
    !
  • Restrict primer search to regions that best
    reflect goals
  • Locate candidate primers
  • Discard candidate primers that show
    undesirable self-hybridization
  • Verify the site-specificity of the primer

57
Primers Design Pipeline (example)
  • Question
  • Hybrid cells are formed containing human BRCA1
    cDNA in mouse cells.
  • We want to identify cells which had incorporated
    the human cDNA, using PCR primers.
  • Steps
  • Look for mouse and human BRCA1 in NCBI/GENE
    databse.
  • Locate BRCA1 cDNA sequence (is there only 1
    transcript ?).
  • If there are more than 1 transcripts, compare
    between them (lets take only 3 for this example)
    and find the region that would identify all
    transcripts (hint use http//www.ebi.ac.uk/Tools
    /clustalw/index.html).
  • Find primer-pairs for PCR (http//frodo.wi.mit.e
    du/primer3/input.htm).
  • Check primer specificity (human only, not
    mouse)
  • UCSC http//genome.ucsc.edu/cgi-bin/hgPcr?comman
    dstart and/or
  • NCBI (BLAST) http//genome.ucsc.edu/cgi-bin/hgPc
    r?commandstart
  • Check primer properties (http//www.basic.north
    western.edu/biotools/oligocalc.html)

58
Bioinformatics - Past and Present
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
SEQUENCES LITERATURE
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
59
Bioinformatics - Present and Future
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
GENE EXPRESSION, GENES FUNCTION, DRUG PERSONAL
THERAPY
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
60
Are We Done ?
Now this is not the end. It is not even the
beginning of the end. But it is perhaps, the
end of the beginning. Winston Churchill, 1942
(3 years into WW2)
http//www.globecartoon.com/neweconomy/10.html
61
Thank-you !
Dr. Metsada Pasmanik-Chor Bioinformatics Unit,
001 Sherman Bldg. Faculty of Life Science,
TAU Tel x 6992 E-mail metsada_at_bioinfo.tau.ac.il
Bioinfo. Unit webpage http//www.tau.ac.il/lifes
ci/bioinfo/
Write a Comment
User Comments (0)
About PowerShow.com