Largescale genome projects - PowerPoint PPT Presentation

About This Presentation
Title:

Largescale genome projects

Description:

Essentially Sub-cloning. Generation of small insert libraries in a well characterised vector. ... is defined as sequenced on both strands using multiple clones. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 48
Provided by: renataa
Category:

less

Transcript and Presenter's Notes

Title: Largescale genome projects


1
Large-scale genome projects
  • Sequencing DNA molecules in the Mb size range
  • All strategies employ the same underlying
    principles
  • Random Shotgun sequencing

2
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
3
Nucleotide Database Growth
4
EMBL breakdown by organism
5
EMBL Release 65
6
Progress on Large Sequencing Projects
7
Strategies for sequencing
  • How big can you go??
  • Large-insert clones
  • cosmids 30-40 kb
  • BACs/PACs 50 - 100 kb
  • Whole chromosomes
  • Whole genomes

8
Genome size and sequencing strategies
Genome size (log Mb)
4
0
1
2
3
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS) with Clone skims
9
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
10
Strategies for sequencing
  • Size and GC composition of genome
  • Volume of data
  • Ease of cloning
  • Ease of sequencing
  • Genome complexity
  • dispersed repetitive sequence
  • telomeres centromeres
  • Politics/Funding

11
Strategies Clone by Clone
  • Simple (0.5 - 2 K reads)
  • Few problems with repeats
  • Relatively simple informatics
  • Scalability
  • Quality of physical map
  • Fingerprint / STS maps
  • End sequencing

12
Strategies Whole Chromosome shotgun (WCS)
  • Requires chromosome isolation
  • Moderate complexity (10s K reads)
  • Problems with repeats
  • Complex informatics
  • Inefficient in isolation
  • Quality of physical map
  • Skims of mapped clones

13
Strategies Whole Genome shotgun (WGS)
  • Moderate to High complexity (10-100s K reads)
  • Problems with repeats
  • Complex informatics
  • Quality of physical map
  • Fingerprint map
  • STS markers
  • End-sequences
  • Skims of mapped clones

14
Sequencing my genome
Politics
Production
Finishing
Annotation
TIME
MONEY
15
What do you get?
DATA!!, DATA !!, and more DATA!!
  • Sequence
  • incomplete v complete
  • First-pass annotation
  • Gene discovery
  • Full annotation
  • A starting point for research

16
Genome annotation is central to functional
genomics
17
(No Transcript)
18
(No Transcript)
19
Sequencing
  • Library construction
  • Colony picking
  • DNA preparation
  • Sequencing reactions
  • Electrophoresis
  • Tracking/Base calling

20
Libraries
  • Essentially Sub-cloning
  • Generation of small insert libraries in a well
    characterised vector.
  • Ease of propagation
  • Ease of DNA purification
  • e.g. puc18, M13

21
Libraries - testing
  • Simple concepts
  • Insert/Vector ratio
  • Real data
  • Insert size
  • Sequence .
  • Simple analysis

22
Sequence generation
  • Pick colonies
  • Template preparation
  • Sequence reactions
  • Standard terminator chemistry
  • pUC libraries sequenced with forward and reverse
    primers

23
Sequence generation
  • Electrophoresis of products
  • Old style - slab gels, 32 64 96 lanes
  • New style - capillary gels, 96 lanes
  • Transfer of gel image to UNIX
  • Sequencing machines use a slave Mac/PC
  • Move data to centralised storage area for
    processing

24
Gel image processing
  • Light-to-Dye estimation
  • Lane tracking
  • Lane editing
  • Trace extraction
  • Trace standardisation
  • Mobility correction
  • Background substitution

25
Pre-processing
  • Base calling using Phred
  • modifies SCF file
  • Quality clipping
  • Vector clipping
  • Sequencing vector
  • Cloning vector
  • Screen for contaminants
  • Feature mark up (repeats/transposons)

26
(No Transcript)
27
Finishing
  • Assembly Process of taking raw single-pass
    reads into contiguous consensus sequence
  • Closure Process of ordering and merging
    consensus sequences into a single contiguous
    sequence
  • Finished is defined as sequenced on both strands
    using multiple clones. In the absence of multiple
    clones the clone must be sequenced with multiple
    chemistries. The overall error rate is estimated
    at less than 1 error per 10 kb

28
Genome Assembly
  • Pre-assembly
  • Assembly
  • Automated appraisal
  • Manual review

29
Pre-Assembly
  • Convert to CAF format
  • flatfile text format
  • choice of assembler
  • choice of post-assembly modules
  • choice of assembly editor

www.sanger.ac.uk/Software/CAF
30
Assembly
  • Assemble using Phrap
  • Read fasta quality scores from CAF file
  • Merge existing Phrap .ace file as necessary
  • Adjust clipping

31
Assembly appraisal
  • auto-edit
  • removes 70 of read discrepancies
  • Remove cloning vector
  • Mark up sequence features
  • finish
  • Identify low-quality regions
  • Cover using re-runs and long-runs
  • Compare with current databases
  • plate contamination

32
Manual Assembly appraisal
  • Use a sequence editor (GAP/consed)
  • Tools to identify Internal joins
  • Tools to identify and import data from an
    overlapping projects
  • Tools to check failed or mis-assembled reads for
    inclusion in project

33
Manual editing
  • Sanger uses 100 edit strategy
  • Where additional data is required
  • Check clipping
  • Additional sequencing
  • Template / Primer / Chemistry
  • Assemble new data into project
  • GAP4 Auto-assemble
  • Repeat whole process

34
Manual Quality Checks
  • Force annotation tag consistency
  • All unedited data is re-assembled using Phrap
  • All high-quality discrepancies are reviewed
  • Confirm restriction digest (clones)
  • Check for inverted repeats
  • Manually check
  • Areas of high-density edits
  • Areas with no supporting unedited data
  • Areas of low read coverage

35
Gap closure
  • Read pairs
  • PCR reactions (long-range / combinatorial)
  • Small-insert libraries
  • Transposon-insertion libraries

36
Gap closure - contig ordering
  • Read pair consistency
  • STS mapping
  • Physical mapping
  • Genetic mapping
  • Optical mapping
  • Large-insert clone
  • skims
  • end-sequencing

37
(No Transcript)
38
Annotation
  • DNA features (repeats/similarities)
  • Gene finding
  • Peptide features
  • Initial role assignment
  • Others- regulatory regions

39
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
40
Genome analysis overview C.elegans
41
DNA features
  • Similarity features
  • mapping repeats
  • simple tandem and inverted
  • repeat families
  • mapping DNA similarities
  • EST/mRNAs in eukaryotes
  • Duplications,
  • RNAs
  • mapping peptide similarities
  • protein similarities

42
Gene finding
  • ORF finding (simple but messy)
  • ab initio prediction
  • Measures of codon bias
  • Simple statistical frequencies
  • Comparative prediction
  • Using similarity data
  • Using cross-species similarities

43
Peptide features
  • Peptide features
  • low-complexity regions
  • trans-membrane regions
  • structural information (coiled-coil)
  • Similarities and alignments
  • Protein families (InterPro/COGS)

44
Initial role assignment
  • Simple attempt to describe the functional
    identity of a peptide
  • Uses data from
  • peptide similarities
  • protein families
  • Vital for data mining
  • Large number of predicted genes remain
    hypothetical or unknown

45
Other regulatory features
  • Ribosomal binding sites
  • Promoter regions

46
(No Transcript)
47
Data Release
  • DNA release
  • Unfinished
  • Finished
  • Nucleotide databases
  • GENBANK/EMBL/DDBJ
  • Peptide databases
  • SWISSPROT/TREMBL/GENPEPT
  • Others
Write a Comment
User Comments (0)
About PowerShow.com