High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics

Description:

Is there anything that a knowledge of bioinformatics ... Fugu. mouse. and. tunicate. 100 microbial genomes. 18 microbial genomes. Complex Genomes Jan. 2003 ... – PowerPoint PPT presentation

Number of Views:613
Avg rating:3.0/5.0
Slides: 63
Provided by: Mur2
Category:

less

Transcript and Presenter's Notes

Title: High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics


1
High Throughput Genomic DNA Sequencing and
Bioinformatics
2
The Human Genome Project
  • The Human genome is now officially sequenced.
  • That was a big job.
  • How did they do it?
  • Is there anything that a knowledge of
    bioinformatics tells us that we should watch out
    for in the human genome sequence?

3
What is DNA Sequencing?
  • A DNA sequence is the order of the bases on one
    strand.
  • By convention, we order the DNA sequence from 5
    to 3, from left to right.
  • Often, only one strand of the DNA sequence is
    written, but usually both strands have been
    sequenced as a check.

4
DNA Sequencing was Awarded the Nobel Prize
  • Walter Gilbert and Fred Sanger were awarded the
    Nobel Prize in Chemistry for the development of
    two different methods of DNA sequencing.
  • http//www.nobel.se/chemistry/laureates/1980/
  • (Oh yes, and Paul Berg for Recombinant DNA- a big
    year!)

5
Two Methods of DNA Sequencing
  • Maxam - Gilbert Method, in which a DNA sequence
    is end-labeled with P-32 phosphate and
    chemically cleaved to leave a signature pattern
    of bands.
  • Sanger Method, in which a DNA sequence is
    annealed to an oligonucleotide primer, which is
    then extended by DNA polymerase using a mixture
    of dNTP and ddNTP (chain terminating) substrates.
    This is the main method used now.

6
Sanger Method is a Form of DNA Synthesis
  • DNA to be sequenced acts as a template for the
    enzymatic synthesis of new DNA strand starting at
    a defined primer.
  • Polymerases used are Pol I type polymerases.
  • Incorporation of a dideoxynucleotide blocks
    further synthesis of the new DNA strand.

7
Remember the Rules of In Vivo DNA Replication
8
Remember the Rules of In Vivo DNA Replication
9
How the Reaction Works
  • If the DNA is double stranded, the reaction is
    started by heating until the two strands of DNA
    separate.
  • Lower the temperature and the primer sticks to
    its intended location by H bonds.
  • DNA polymerase starts elongating the primer.
  • If allowed to go to completion, a new strand of
    DNA would be the result.

10
How the Reaction Works
  • If we start with a billion identical pieces of
    template DNA, we'll get a billion new copies of
    one of its strands.
  • We run the reactions, however, in the presence of
    a dideoxyribonucleotide.
  • This is just like regular DNA, except it has no
    3' hydroxyl group - once it's added to the end of
    a DNA strand, there's no way to continue
    elongating it.

11
(No Transcript)
12
(No Transcript)
13
Original Sanger Sequencing
  • A mixture of dNTPs and a single ddNTP is used in
    the reaction tubes.
  • We can start with 4 different reaction tubes,
    each with all four dNTPS (dATP, dGTP, dTTP, dCTP)
    and ONLY one of either ddA, ddC, ddG and ddT
    (only 1).
  • The key is MOST of the nucleotides are regular
    ones, and just a fraction of them are
    dideoxynucleotides.

14
An Example of a T tube
  • MOST of the time when a 'T' is required to make
    the new strand, the enzyme will get a good one
    and it continues to elongate.
  • MOST of the time after adding a T, the enzyme
    will go ahead and add more nucleotides.
  • However, about 1 of the time, the enzyme will
    get a dideoxy-T, and that strand can never again
    be elongated.
  • It eventually breaks away from the enzyme,
    leaving a dead end DNA that cant be further
    extended.

15
Original Sanger Sequencing
  • Sooner or later ALL of the copies will get
    terminated by a T.
  • But each time the enzyme makes a new strand, the
    place it gets stopped will be random.
  • In millions of starts, there will be strands
    stopping at every possible T along the way.

16
Specific Primers Start the Sequence
  • ALL of the strands we make started at one exact
    position.
  • ALL of them end with a T. There are billions of
    them ... many millions at each possible T
    position.
  • To find out where all the T's are in our newly
    synthesized strand, all we have to do is find out
    the sizes of all the terminated products!

17
(No Transcript)
18
Non-Radioactive DNA Labels
  • Add a chemical tag to each ddNTP that can emit a
    fluorescent color when excited by a laser.
  • We can add a different dye to each ddNTP and each
    is excited by a different laser wave length.
  • Run the reactions in only one tube, not 4 tubes!
  • This is easier and faster. A big contribution to
    high throughput sequencing.

19
Automated DNA Sequencing
  • We don't even have to 'read' the sequence from
    the gel - the computer does that for us!
  • This is a plot of the colors detected in one
    'lane' of a gel (one sample), scanned from
    smallest fragments to largest.
  • The computer even interprets the colors by
    printing the nucleotide sequence across the top
    of the plot.
  • This is just a fragment of the entire file, which
    would span around 700 or so nucleotides of
    accurate sequence.

20
Automated DNA Sequence Readouts
21
(No Transcript)
22
The Biology of DNA Sequencing
  • Virtually all DNA sequencing, (both automated and
    manual) relies on the Sanger method
  • DNA replication with dideoxy chain termination
  • separation of the resulting molecules by
    polyacrylamide gel electrophoresis.
  • The DNA fragment to be sequenced must first be
    cloned into a vector (plasmid or lambda).
  • Then the cloned DNA must be copied in a test tube
    (in vitro ) by a DNA polymerase enzyme to obtain
    a sufficient quantity to be sequenced.

23
(No Transcript)
24
Sample DNA Sequence from ABI sequencer
25
  • Automated sequencing machines,
  • particularly those made by PE Applied
    Biosystems, use 4 colors of dye, so they can read
    all 4 bases at once.

26
Challenges of DNA Sequencing
  • One technician with an automated DNA sequencer
    can produce over 20 KB of raw sequence data per
    day.
  • The real challenge of DNA sequencing is in the
    analysis of the data

27
J. Craig Venter
  • Proposed a whole-genome shotgun sequencing method
    to NIH in 1991. Proposal rejected.
  • Sets up The Institute for Genomic Research (TIGR)
    in 1992 (private and non-profit)
  • TIGR publishes the first complete genome sequence
    in 1995 (Haemophilis influenzae)
  • Forms Celera Genomics in 1998 to sequence human
    genome in three years (private, for-profit)
  • The Sequence of the Human Genome is published in
    Science. February 2001
  • Venter departs Celera. 2002

28
Human Genome Project Sequencing Strategy
  • Clone-based physical mapping
  • Digest genome and make Bacterial Artificial
    Chromosomes (BACs, 150,000 bp each)
  • Digest BACs to create fingerprints
  • Organize BACs to form contigs
  • Select BAC clones for sequencing
  • Shear BACs and shotgun clone
  • Sequence clones and assemble overlaps

29
(No Transcript)
30
(No Transcript)
31
Celera Sequencing Strategy
  • Whole-genome shotgun sequencing of five
    individuals with 5 fold coverage
  • Computer assembles overlapping sequences to form
    contigs
  • Contigs are assembled into scaffolds
  • Scaffolds are mapped to the genome by two or more
    Sequence Tagged Site (STS) markers

32
(No Transcript)
33
Technology Breakthroughs
  • Development of Expressed Sequence Tag (EST)
    method to discover and map human genes
  • Development of Bacterial Artificial Chromosomes
    (BACs) to clone large DNA fragments
  • Development of an automated high-throughput
    capillary DNA sequencer in 1998 (Applied
    Biosystems ABI PRISM 3700 DNA Analyzer)
  • Development of powerful computers and software to
    analyze sequence data

34
Genome Questions
  • Has every base in our genome been sequenced?
  • What is the total number of genes and where are
    they located?
  • How many genes have an unknown function?
  • What percent of our DNA encodes genes and what is
    the remainder?
  • Do we share DNA sequences with other organisms?
  • How much sequence variation is there between
    individuals?

35
Genome SequencingHTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
assembly
Draft Sequence (HTG division)
36
GSS Division Genome Survey Sequences
  • Genomic equivalent of ESTs
  • BAC and other first pass surveys
  • BAC end sequences
  • Whole Genome Shotgun (some)
  • RAPIDS and other anonymous loci

SP6 end
T7 end
37
Working Draft Sequence
38
Technology Limitations
  • Sequences can only be determined in approximately
    400-800 base pair sections known as reads.
  • This is due to both the biochemistry of the DNA
    polymerase enzyme and the resolution of
    polyacrylamide gel electrophoresis.
  • Most genes contain many thousands of bp and many
    modern sequencing projects are intended to
    produce complete sequences of large genomic
    regions (millions of bp)

39
Assembly of Contigs
  • As a result, all sequencing projects must involve
    the division of the target DNA into a set of
    overlapping 500 bp fragments.
  • Then these fragments are assembled into complete
    sequences (contigs)
  • Contig contiguous sequenced region
  • Assembly of overlapping fragments is a
    computational problem

40
(No Transcript)
41
Contig Assembly Problems
  • 1) The 500 bp reads of sequence data have errors
    of both incorrectly determined bases and
    insertions/deletions.
  • 2) The error rate is highest at the beginning and
    ends of the reads - precisely the regions that
    must be overlapped.
  • 3) Some sequence from cloning vectors is often
    included at the ends of sequence reads.

42
Sequence Assembly Algorithms
  • Different than similarity searching
  • Look for ungapped overlaps at end of fragments
  • (method of Wilbur and Lipman, (SIAM J. Appl.
    Math. 44 557-567, 1984)
  • High degree of identity over a short region
  • Want to exclude chance matches, but not be thrown
    off by sequencing errors
  • Vector removal uses similar approach, but less
    stringent
  • should recognize small regions of identity and
    tolerate more mismatches

43
Celera Innovation Clone End Tracking
  • Create 3 libraries with 2, 10, and 50 KB inserts
  • Use information from clone ends distance and
    orientation
  • Can span some gaps between contigs and determine
    the size of gaps

44
Overlap at ends, not internal
45
Software determines strategy
  • Based on their faith in the speed and
    reliability of sequence analysis/assembly
    software, researchers have generally taken one of
    three different approaches to planning sequencing
    projects
  • Ordered sub-cloning
  • Primer walking
  • Shotgun sequencing

46
Ordered cloning
  • People who don't trust software generally put a
    lot of time into dividing large pieces of DNA
    into small ordered overlapping fragments
  • This strategy requires much more initial cloning
    work in the laboratory.
  • but it minimizes the number of actual sequencing
    reads required to complete a project.
  • It is easy to assemble the reads since it is
    known how they should fit together to form the
    final contig.

47
Primer Walking
  • Make a new primer from the end of each new
    sequence read
  • It requires very fast and accurate analysis of
    sequence reads since each step uses information
    from the previous read
  • Skips sub-cloning step entirely since all
    sequencing reactions can be done on one large
    clone
  • Expensive to make a lot of PCR primers
  • but the price of primer synthesis keeps dropping
    there is an economy of scale
  • Assembly problems are minimized since both the
    order and the amount of overlap of reads are known

48
Shotgun Sequencing
  • Shotgun sequencing takes maximum advantage of the
    speed and low cost of automated sequencing
  • relies totally on software to assembly a jumble
    of essentially random sequence reads into a
    coherent and accurate contig
  • TIGR demonstrated proof of concept on the
    genomes of Haemophilus influenzae, Methanococcus
    jannaschii, and Mycoplasma genitalium
  • Celera Genomics demonstrated the ability to
    shotgun sequence the entire human genome (?)

49
Human Genome Assembly
  • The HGP vs. Celera race to sequence the entire
    human genome was a classic battle of different
    strategies
  • The HGP used an ordered cloning approach
  • Breaking the genome into mapped BAC clones, then
    shotgun sequencing the BACs
  • Celera used a modified shotgun method
  • Random clones of various sizes (size selected
    libraries)
  • Plus relative mapping of clone ends (they must be
    located in the assembly at the correct distance
    and orientations
  • Created custom software to handle the assembly
  • Celera did make use of the scaffold built by
    the HGP

50
Other Large Sequencing Projects
  • Phylogenetic identification/analysis
  • medical studies of bacteria
  • environmental samples
  • EST sequencing - differential expression
  • cDNA studies
  • alternate splicing
  • full length transcripts
  • Genotyping
  • score known alleles
  • identify new mutations

51
Automation
  • The "pipeline" approach
  • Vector removal
  • Assembly of identical and/or overlapping
    fragments
  • Identify genes
  • Lookup on genome if fully sequenced organism
  • Or genome contigs for partially sequences
    organsims
  • BLAST search of GeneBank for similar genes
  • Lookup in specialized database of "predicted
    genes"
  • ie. ENSEMBL
  • Project specific analysis
  • differentials between sets
  • Phylogenetics

52
DATABASE!!
  • What these projects all share is a need to keep
    track of a lot of data
  • Hundreds to thousands of sequences
  • Many fields of information about each one
  • Organism, library, plate ID for each clone
  • the sequence itself
  • cluster/contig membership
  • best BLAST hit (accession , e-value, alignment)
  • genome position
  • Can't keep track just using folders and text
    files on your hard drive
  • Design the database to include all possible
    fields
  • (its a lot harder to add info later)

53
Computer tools for sequencing
  • A wide variety of different software tools have
    been created to aid DNA sequencing projects.
  • Each genome project lab has built its own custom
    software
  • UNIX
  • Based on a particular workflow design
  • PHRED, PHRAP, and Consed
  • Many packages for the individual investigator -
    included in most comprehensive molecular
    biology products MacVector, LaserGene, DNA,
    etc.
  • I will focus on the assembly tools in GCG.

54
The GCG Fragment Assembly System
  • GCG has a complete set of programs that allow
    data entry, and assembly of overlapping
    nucleotide sequence fragments into one contig
  • SEQED a single sequence editor
  • GELSTART creates fragment assembly projects
  • GELENTER adds sequences (reads) to an assembly
    project, input of new sequences from keyboard,
    digitizer, or import of existing text files
  • GELMERGE assembles individual sequences into
    contigs, can automatically remove vector
    sequences
  • GELASSEMBLE multiple sequence editor for viewing
    and editing contigs, allows manual alignment of
    fragments insertion/deletion of gaps and changing
    of individual bases
  • GELDISASSEMBLE breaks up contigs into individual
    sequences within a project
  • GELVIEW displays contigs as a schematic display
    of overlapping fragments

55
(No Transcript)
56
SeqLab has a Chromatogram viewer
57
Other Chromatogram Viewers
  • Applied Biosystems has a free viewer/editor
    program for sequence chromatograms
  • It is called EditView and it is a Macintosh only
    program (does not work in System 9.1 and newer)
  • http//cancer-seqbase.uchicago.edu/documents/EditV
    iew.hqx
  • There are a couple of viewers for Windows
    machines
  • ABIView is free from David H. Klatte
  • http//bioinformatics.weizmann.ac.il/software/abiv
    iew/abiinfo.html
  • Chromas is 50 shareware from Conor McCarthy,
    Technelysium Pty Ltd in Australiahttp//www.techne
    lysium.com.au/chromas.html

58
The Genome Sequencing Era
59
Complex Genomes Jan. 2003
  • Chordates
  • Human
  • Mouse
  • Rat
  • Pufferfish
  • Sea squirt (Ciona)
  • Arthropods
  • D. melanogaster
  • D. simulans
  • A. gambiae
  • Higher plants
  • Arabidopsis
  • Rice
  • Fungi
  • Aspergillus terreus

60
Coming soon
  • In progress
  • purple sea urchin
  • zebrafish
  • NHGRIs Priority Organisms
  • Chicken
  • Cow
  • Dog
  • Chimpanzee
  • Honeybee
  • Tetrahymena
  • Oxytrichia
  • Several fungi
  • Over 100 bacterial genomes

61
Some Books on the Human Genome Project
  • The Common Thread A Story of Science, Politics,
    Ethics and the Human Genome by John Sulston,
    Georgina Ferry
  • The Gene Masters How a New Breed of Scientific
    Entrepeneurs Raced for the Biggest Prize in
    Biology by Ingrid Wickelgren
  • The Genome War How Craig Venter Tried to Capture
    the Code of Life and Save the World by James
    Shreeve

62
Controversy and Issues
  • Does human DNA sequence information belong to
    everyone?
  • Should publication require the release of all
    data?
  • Did Celera use public information to complete the
    human sequence?
  • Should a gene or life form be patented?
  • Should personal genetic information be protected
    from public release?
Write a Comment
User Comments (0)
About PowerShow.com