High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics

Description:

Is there anything that a knowledge of bioinformatics ... Fugu. mouse. and. tunicate. 100 microbial genomes. 18 microbial genomes. Complex Genomes Jan. 2003 ... – PowerPoint PPT presentation

Number of Views:613

Avg rating:3.0/5.0

Slides: 63

Provided by: Mur2

Category:

more less

Transcript and Presenter's Notes

Title: High%20Throughput%20Genomic%20DNA%20Sequencing%20and%20Bioinformatics

1
High Throughput Genomic DNA Sequencing and
Bioinformatics
2
The Human Genome Project

The Human genome is now officially sequenced.
That was a big job.
How did they do it?
Is there anything that a knowledge of
bioinformatics tells us that we should watch out
for in the human genome sequence?

3
What is DNA Sequencing?

A DNA sequence is the order of the bases on one
strand.
By convention, we order the DNA sequence from 5
to 3, from left to right.
Often, only one strand of the DNA sequence is
written, but usually both strands have been
sequenced as a check.

4
DNA Sequencing was Awarded the Nobel Prize

Walter Gilbert and Fred Sanger were awarded the
Nobel Prize in Chemistry for the development of
two different methods of DNA sequencing.
http//www.nobel.se/chemistry/laureates/1980/
(Oh yes, and Paul Berg for Recombinant DNA- a big
year!)

5
Two Methods of DNA Sequencing

Maxam - Gilbert Method, in which a DNA sequence
is end-labeled with P-32 phosphate and
chemically cleaved to leave a signature pattern
of bands.
Sanger Method, in which a DNA sequence is
annealed to an oligonucleotide primer, which is
then extended by DNA polymerase using a mixture
of dNTP and ddNTP (chain terminating) substrates.
This is the main method used now.

6
Sanger Method is a Form of DNA Synthesis

DNA to be sequenced acts as a template for the
enzymatic synthesis of new DNA strand starting at
a defined primer.
Polymerases used are Pol I type polymerases.
Incorporation of a dideoxynucleotide blocks
further synthesis of the new DNA strand.

7
Remember the Rules of In Vivo DNA Replication
8
Remember the Rules of In Vivo DNA Replication
9
How the Reaction Works

If the DNA is double stranded, the reaction is
started by heating until the two strands of DNA
separate.
Lower the temperature and the primer sticks to
its intended location by H bonds.
DNA polymerase starts elongating the primer.
If allowed to go to completion, a new strand of
DNA would be the result.

10
How the Reaction Works

If we start with a billion identical pieces of
template DNA, we'll get a billion new copies of
one of its strands.
We run the reactions, however, in the presence of
a dideoxyribonucleotide.
This is just like regular DNA, except it has no
3' hydroxyl group - once it's added to the end of
a DNA strand, there's no way to continue
elongating it.

11
(No Transcript)
12
(No Transcript)
13
Original Sanger Sequencing

A mixture of dNTPs and a single ddNTP is used in
the reaction tubes.
We can start with 4 different reaction tubes,
each with all four dNTPS (dATP, dGTP, dTTP, dCTP)
and ONLY one of either ddA, ddC, ddG and ddT
(only 1).
The key is MOST of the nucleotides are regular
ones, and just a fraction of them are
dideoxynucleotides.

14
An Example of a T tube

MOST of the time when a 'T' is required to make
the new strand, the enzyme will get a good one
and it continues to elongate.
MOST of the time after adding a T, the enzyme
will go ahead and add more nucleotides.
However, about 1 of the time, the enzyme will
get a dideoxy-T, and that strand can never again
be elongated.
It eventually breaks away from the enzyme,
leaving a dead end DNA that cant be further
extended.

15
Original Sanger Sequencing

Sooner or later ALL of the copies will get
terminated by a T.
But each time the enzyme makes a new strand, the
place it gets stopped will be random.
In millions of starts, there will be strands
stopping at every possible T along the way.

16
Specific Primers Start the Sequence

ALL of the strands we make started at one exact
position.
ALL of them end with a T. There are billions of
them ... many millions at each possible T
position.
To find out where all the T's are in our newly
synthesized strand, all we have to do is find out
the sizes of all the terminated products!

17
(No Transcript)
18
Non-Radioactive DNA Labels

Add a chemical tag to each ddNTP that can emit a
fluorescent color when excited by a laser.
We can add a different dye to each ddNTP and each
is excited by a different laser wave length.
Run the reactions in only one tube, not 4 tubes!
This is easier and faster. A big contribution to
high throughput sequencing.

19
Automated DNA Sequencing

We don't even have to 'read' the sequence from
the gel - the computer does that for us!
This is a plot of the colors detected in one
'lane' of a gel (one sample), scanned from
smallest fragments to largest.
The computer even interprets the colors by
printing the nucleotide sequence across the top
of the plot.
This is just a fragment of the entire file, which
would span around 700 or so nucleotides of
accurate sequence.

20
Automated DNA Sequence Readouts
21
(No Transcript)
22
The Biology of DNA Sequencing

Virtually all DNA sequencing, (both automated and
manual) relies on the Sanger method
DNA replication with dideoxy chain termination
separation of the resulting molecules by
polyacrylamide gel electrophoresis.
The DNA fragment to be sequenced must first be
cloned into a vector (plasmid or lambda).
Then the cloned DNA must be copied in a test tube
(in vitro ) by a DNA polymerase enzyme to obtain
a sufficient quantity to be sequenced.

23
(No Transcript)
24
Sample DNA Sequence from ABI sequencer
25

Automated sequencing machines,
particularly those made by PE Applied
Biosystems, use 4 colors of dye, so they can read
all 4 bases at once.

26
Challenges of DNA Sequencing

One technician with an automated DNA sequencer
can produce over 20 KB of raw sequence data per
day.
The real challenge of DNA sequencing is in the
analysis of the data

27
J. Craig Venter

Proposed a whole-genome shotgun sequencing method
to NIH in 1991. Proposal rejected.
Sets up The Institute for Genomic Research (TIGR)
in 1992 (private and non-profit)
TIGR publishes the first complete genome sequence
in 1995 (Haemophilis influenzae)
Forms Celera Genomics in 1998 to sequence human
genome in three years (private, for-profit)
The Sequence of the Human Genome is published in
Science. February 2001
Venter departs Celera. 2002

28
Human Genome Project Sequencing Strategy

Clone-based physical mapping
Digest genome and make Bacterial Artificial
Chromosomes (BACs, 150,000 bp each)
Digest BACs to create fingerprints
Organize BACs to form contigs
Select BAC clones for sequencing
Shear BACs and shotgun clone
Sequence clones and assemble overlaps

29
(No Transcript)
30
(No Transcript)
31
Celera Sequencing Strategy

Whole-genome shotgun sequencing of five
individuals with 5 fold coverage
Computer assembles overlapping sequences to form
contigs
Contigs are assembled into scaffolds
Scaffolds are mapped to the genome by two or more
Sequence Tagged Site (STS) markers

32
(No Transcript)
33
Technology Breakthroughs

Development of Expressed Sequence Tag (EST)
method to discover and map human genes
Development of Bacterial Artificial Chromosomes
(BACs) to clone large DNA fragments
Development of an automated high-throughput
capillary DNA sequencer in 1998 (Applied
Biosystems ABI PRISM 3700 DNA Analyzer)
Development of powerful computers and software to
analyze sequence data

34
Genome Questions

Has every base in our genome been sequenced?
What is the total number of genes and where are
they located?
How many genes have an unknown function?
What percent of our DNA encodes genes and what is
the remainder?
Do we share DNA sequences with other organisms?
How much sequence variation is there between
individuals?

35
Genome SequencingHTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
assembly
Draft Sequence (HTG division)
36
GSS Division Genome Survey Sequences

Genomic equivalent of ESTs
BAC and other first pass surveys
BAC end sequences
Whole Genome Shotgun (some)
RAPIDS and other anonymous loci

SP6 end
T7 end
37
Working Draft Sequence
38
Technology Limitations

Sequences can only be determined in approximately
400-800 base pair sections known as reads.
This is due to both the biochemistry of the DNA
polymerase enzyme and the resolution of
polyacrylamide gel electrophoresis.
Most genes contain many thousands of bp and many
modern sequencing projects are intended to
produce complete sequences of large genomic
regions (millions of bp)

39
Assembly of Contigs

As a result, all sequencing projects must involve
the division of the target DNA into a set of
overlapping 500 bp fragments.
Then these fragments are assembled into complete
sequences (contigs)
Contig contiguous sequenced region
Assembly of overlapping fragments is a
computational problem

40
(No Transcript)
41
Contig Assembly Problems

1) The 500 bp reads of sequence data have errors
of both incorrectly determined bases and
insertions/deletions.
2) The error rate is highest at the beginning and
ends of the reads - precisely the regions that
must be overlapped.
3) Some sequence from cloning vectors is often
included at the ends of sequence reads.

42
Sequence Assembly Algorithms

Different than similarity searching
Look for ungapped overlaps at end of fragments
(method of Wilbur and Lipman, (SIAM J. Appl.
Math. 44 557-567, 1984)
High degree of identity over a short region
Want to exclude chance matches, but not be thrown
off by sequencing errors
Vector removal uses similar approach, but less
stringent
should recognize small regions of identity and
tolerate more mismatches

43
Celera Innovation Clone End Tracking

Create 3 libraries with 2, 10, and 50 KB inserts
Use information from clone ends distance and
orientation
Can span some gaps between contigs and determine
the size of gaps

44
Overlap at ends, not internal
45
Software determines strategy

Based on their faith in the speed and
reliability of sequence analysis/assembly
software, researchers have generally taken one of
three different approaches to planning sequencing
projects
Ordered sub-cloning
Primer walking
Shotgun sequencing

46
Ordered cloning

People who don't trust software generally put a
lot of time into dividing large pieces of DNA
into small ordered overlapping fragments
This strategy requires much more initial cloning
work in the laboratory.
but it minimizes the number of actual sequencing
reads required to complete a project.
It is easy to assemble the reads since it is
known how they should fit together to form the
final contig.

47
Primer Walking

Make a new primer from the end of each new
sequence read
It requires very fast and accurate analysis of
sequence reads since each step uses information
from the previous read
Skips sub-cloning step entirely since all
sequencing reactions can be done on one large
clone
Expensive to make a lot of PCR primers
but the price of primer synthesis keeps dropping
there is an economy of scale
Assembly problems are minimized since both the
order and the amount of overlap of reads are known

48
Shotgun Sequencing

Shotgun sequencing takes maximum advantage of the
speed and low cost of automated sequencing
relies totally on software to assembly a jumble
of essentially random sequence reads into a
coherent and accurate contig
TIGR demonstrated proof of concept on the
genomes of Haemophilus influenzae, Methanococcus
jannaschii, and Mycoplasma genitalium
Celera Genomics demonstrated the ability to
shotgun sequence the entire human genome (?)

49
Human Genome Assembly

The HGP vs. Celera race to sequence the entire
human genome was a classic battle of different
strategies
The HGP used an ordered cloning approach
Breaking the genome into mapped BAC clones, then
shotgun sequencing the BACs
Celera used a modified shotgun method
Random clones of various sizes (size selected
libraries)
Plus relative mapping of clone ends (they must be
located in the assembly at the correct distance
and orientations
Created custom software to handle the assembly
Celera did make use of the scaffold built by
the HGP

50
Other Large Sequencing Projects

Phylogenetic identification/analysis
medical studies of bacteria
environmental samples
EST sequencing - differential expression
cDNA studies
alternate splicing
full length transcripts
Genotyping
score known alleles
identify new mutations

51
Automation

The "pipeline" approach
Vector removal
Assembly of identical and/or overlapping
fragments
Identify genes
Lookup on genome if fully sequenced organism
Or genome contigs for partially sequences
organsims
BLAST search of GeneBank for similar genes
Lookup in specialized database of "predicted
genes"
ie. ENSEMBL
Project specific analysis
differentials between sets
Phylogenetics

52
DATABASE!!

What these projects all share is a need to keep
track of a lot of data
Hundreds to thousands of sequences
Many fields of information about each one
Organism, library, plate ID for each clone
the sequence itself
cluster/contig membership
best BLAST hit (accession , e-value, alignment)
genome position
Can't keep track just using folders and text
files on your hard drive
Design the database to include all possible
fields
(its a lot harder to add info later)

53
Computer tools for sequencing

A wide variety of different software tools have
been created to aid DNA sequencing projects.
Each genome project lab has built its own custom
software
UNIX
Based on a particular workflow design
PHRED, PHRAP, and Consed
Many packages for the individual investigator -
included in most comprehensive molecular
biology products MacVector, LaserGene, DNA,
etc.
I will focus on the assembly tools in GCG.

54
The GCG Fragment Assembly System

GCG has a complete set of programs that allow
data entry, and assembly of overlapping
nucleotide sequence fragments into one contig
SEQED a single sequence editor
GELSTART creates fragment assembly projects
GELENTER adds sequences (reads) to an assembly
project, input of new sequences from keyboard,
digitizer, or import of existing text files
GELMERGE assembles individual sequences into
contigs, can automatically remove vector
sequences
GELASSEMBLE multiple sequence editor for viewing
and editing contigs, allows manual alignment of
fragments insertion/deletion of gaps and changing
of individual bases
GELDISASSEMBLE breaks up contigs into individual
sequences within a project
GELVIEW displays contigs as a schematic display
of overlapping fragments

55
(No Transcript)
56
SeqLab has a Chromatogram viewer
57
Other Chromatogram Viewers

Applied Biosystems has a free viewer/editor
program for sequence chromatograms
It is called EditView and it is a Macintosh only
program (does not work in System 9.1 and newer)
http//cancer-seqbase.uchicago.edu/documents/EditV
iew.hqx
There are a couple of viewers for Windows
machines
ABIView is free from David H. Klatte
http//bioinformatics.weizmann.ac.il/software/abiv
iew/abiinfo.html
Chromas is 50 shareware from Conor McCarthy,
Technelysium Pty Ltd in Australiahttp//www.techne
lysium.com.au/chromas.html

58
The Genome Sequencing Era
59
Complex Genomes Jan. 2003

Chordates
Human
Mouse
Rat
Pufferfish
Sea squirt (Ciona)
Arthropods
D. melanogaster
D. simulans
A. gambiae

Higher plants
Arabidopsis
Rice
Fungi
Aspergillus terreus

60
Coming soon

In progress
purple sea urchin
zebrafish
NHGRIs Priority Organisms
Chicken
Cow
Dog
Chimpanzee
Honeybee
Tetrahymena
Oxytrichia
Several fungi
Over 100 bacterial genomes

61
Some Books on the Human Genome Project

The Common Thread A Story of Science, Politics,
Ethics and the Human Genome by John Sulston,
Georgina Ferry
The Gene Masters How a New Breed of Scientific
Entrepeneurs Raced for the Biggest Prize in
Biology by Ingrid Wickelgren
The Genome War How Craig Venter Tried to Capture
the Code of Life and Save the World by James
Shreeve

62
Controversy and Issues