Compartmentalized%20Shotgun%20Assembly - PowerPoint PPT Presentation

About This Presentation
Title:

Compartmentalized%20Shotgun%20Assembly

Description:

Compartmentalized Shotgun Assembly??? CSA Two stated motivations?? – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 34
Provided by: Jeff575
Category:

less

Transcript and Presenter's Notes

Title: Compartmentalized%20Shotgun%20Assembly


1
Compartmentalized Shotgun Assembly
?
?
?
CSA Two stated motivations?
?
2
Matcher matched
  • matched Celera reads with PFP BACTIGS,
  • 20.76 million Celera reads matched (76),
  • 0.62 million had a mate pair that matched,
  • 2.97 million Celera reads were unique and
    un-screened,
  • 1.189 Gbp of unique DNA sequence, at 5.11X yields
    a predicted 240 Mbp of unique Celera sequence.

3
Combining Assembler assemblesCelera and PFP
sequence for a transient assembly
  • first, Celera reads,
  • are checked for over-collapsed regions,
  • sequences with Mate Pairs that match region are
    kept,
  • more mate pair matches higher value assembly,
  • then Celera reads are combined with PFP reads,
  • Greedy program recognizes highest value
    assemblies first in order to build contigged
    sequence,
  • then Stones to fill the gaps.

4
ResultsPFP vs. CSA
  • The GenBank (PFP) data for the Phase 1 and 2 BACs
    yielded an average of 19.8 bactigs per BAC, of
    average size 8099 bp,
  • Application of the Combining Assembler resulted
    in individual Celera/BAC assemblies being put
    together into an average of 1.83 scaffolds
    (median of 1 scaffold) per BAC region consisting
    of an average of 8.57 contigs of average size
    18,973 bp.

pp. 1313, 1st column, last paragraph
5
Compartmentalized Shotgun Assembly
?
6
Celera Unique ScaffoldsWGA
  • The 5.89 million Celera fragments not matching
    the GenBank data were assembled with the
    whole-genome assembler.
  • The Celera assembly resulted in a set of
    scaffolds totaling 442 Mbp in span and consisting
    of 326 Mbp of sequence. More than 20 of the
    scaffolds were gt5 kbp long, and these averaged
    63 sequence and 27 gaps with a total of 302 Mbp
    of sequence.

7
Compartmentalized Shotgun Assembly
?
?
8
Tiler tiles
  • Scaffolds into larger components using
  • Mate End Pairs,
  • BAC-end pairs,
  • STS,
  • Heuristic a rule of thumb, simplification, or
    educated guess that reduces or limits the search
    for solutions in domains that are difficult and
    poorly understood. Unlike algorithms, heuristics
    do not guarantee optimal (or even feasible)
    solutions and are often used with no theoretical
    guarantee.

9
Compartmentalized Shotgun Assembly
  • 3,845 Components
  • shredded, WGA


10
93
  • gt 100 kbp Scaffolds
  • 92 sequence, 8 gaps,
  • 105,264 gaps, 1,935 scaffolds,
  • 1.3 Mbp scaffold size, 23,242 bp contig size.
  • gt 49 gaps lt 500 bp,
  • gt 62 gaps lt 1 kb,
  • No gap larger than 100 kbp.

11
How do you compare assemblies, and why?
12
WGA vs. CSA
  • This gives some measure of consistent coverage
  • 1.982 Gbp (95.00) of the WGA is covered by the
    CSA,
  • 2.169 Gbp (87.69) of the CSA is covered by the
    WGA.
  • Only 31 scaffolds were unique to an assembly,
  • 295 kb (0.012) CSA inconsistent with WGA,
  • 2.108 Mb (0.11 WGA inconsistent with CSA,

Overall, CSA slightly better than WGA Why? How
does the CSA compare with the Clone-by-Clone
approach?
13
Map First then sequence
Sequence First then map
14
Mapping ScaffolderGM99 and fingerprint maps
15
Mapping ScaffolderGM99 and fingerprint maps
16
?
Tab. 4
17
Assembly and Validation Analysisdid it really
work?
  • Completeness of euchromatic sequence in the
    assembly,
  • estimate the size and of gaps (Table 3),

18
Assembly and Validation Analysisdid it really
work?
  • Completeness of euchromatic sequence in the
    assembly,
  • estimate the size and of gaps (Table 3),
  • compare to finished sequences of 21,22
  • 3.4 Mb gaps, 75 gaps are repeats,
  • match with STS data (ePCR, BLAST),
  • 93.4 tested found assembled, 5.5 in chaff
    98.9,
  • Correctness
  • Mate-Pair analysis.

19
Mate Pair Analysis
  • Valid correct orientation and correct distance
    3 SD

2.7 were found to be invalid.
20
CSA vs. PFP
What does this show?
21
Chromosome 21
Yellow Same Orientation
Red Out of Order, Orientation
22
Chromosome 8
23
(No Transcript)
24
  • Whats the take home message?

25
PFP
CSA
Fig. 7, key
26
Fig. 7
27
Gene Prediction and AnnotationWhys it So Hard
to Find Genes?
  • Exons/Introns,
  • Alternative Splicing/Termination,
  • Alternate transcription start/stop sites,
  • Tandem Repeats, Psuedogenes, etc.
  • We dont really understand all there is to know
    about gene and genome structure,
  • etc.

28
Gene Number Predictions?before PFP, WGA or CSA
  • Textbooks 100,000
  • Upgraded to 142,634? EST data
  • counts that fall far short
  • EST Data --gt 35,000
  • 35,000 genes based on the density of Chromosome
    22
  • 28, 000 - 34,000 Humans vs. pufferfish

29
Automated Gene AnnotationOTTO
  • Tell me how it works.
  • How was it validated, including Table 7.
  • if necessary, use the Online Primer and other
    NCBI resources to broaden your understanding,
  • cDNAs, ESTs, RefSeq, Protein Sequence Databases,
    BLAST, etc. are described in appropriate detail
    on the WEB.

30
Questions?
31
(No Transcript)
32
Repeat Resolver ...most of the remaining gaps
were due to repeats.
Rocks Use low Discriminator Value contig
sets to fill gaps, - find two or more mate
pairs with unambiguous matches in the scaffold
near the gap (2 kb, 10kb or 50 kb), (1 in
107),
Stones - find mate pair matches 2 kb, 10 kb,
and 50 kb from gap, place the mate in the gap,
check to see if its consistent with other
placed sequences.
33
Repeat Resolver ...most of the remaining gaps
were due to repeats.
Rocks Use low Discriminator Value contig
sets to fill gaps, - find two or more mate
pairs with unambiguous matches in the scaffold
near the gap (2 kb, 10kb or 50 kb), (1 in
107), Stones - find mate pair matches 2 kb,
10 kb, and 50 kb from gap, place the mate in the
gap, check to see if its consistent with other
placed sequences.
Write a Comment
User Comments (0)
About PowerShow.com