A hierarchical approach to building contig scaffolds - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

A hierarchical approach to building contig scaffolds

Description:

A hierarchical approach to building contig scaffolds. Mihai Pop. Dan Kosack ... un-related contigs 5,000-10,000 base ... node coloring problem (forward ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 24
Provided by: cbcb6
Category:

less

Transcript and Presenter's Notes

Title: A hierarchical approach to building contig scaffolds


1
A hierarchical approach to building contig
scaffolds
  • Mihai Pop
  • Dan Kosack
  • Steven L. Salzberg
  • Genome Research 14(1), pp. 149-159, 2004.

2
Sequencing pipeline
  • Random sequencing
  • un-related reads 500-700 base-pairs
  • Assembly
  • un-related contigs 5,000-10,000 base-pairs
  • Scaffolding
  • un-related scaffolds 30,000-50,000 base-pairs
  • Finishing/gap closure
  • completed genomes millions-billions of
    base-pairs

3
Scaffolding
  • Given a set of non-overlapping contigsorder and
    orient them along a chromosome

4
Clone-mates
Insert
R
F
Clone
5
Scaffolder output
Physical gaps


Sequencing gaps

6
Problems with the data
  • Incorrect sizing of inserts
  • cut from gel sizing is subjective
  • error increases with size
  • Chimeras (ends belong to different inserts)
  • biological reasons (esp. for large sized inserts)
  • sample tracking (human error)
  • Software must handle a certain error rate.

7
Theoretical abstraction
  • Given a set of entities (reads/contigs) and
    constraints between them (overlaps/mate pairs)
    provide a linear/circular embedding that
    preserves most constraints.

8
Graph representation
  • Nodes contigs
  • Directed edges constraints on relative
    placement of contigs relative order and
    relative orientation
  • Embedding order (coordinate along chromosome)
    and orientation (strand sampled)

9
Challenges
  • Orientation node coloring problem
    (forward/reverse)
  • feasibility no cycles with odd number of
    reversal edges (blue edges)
  • optimality remove minimum number of edges such
    that a solution exists (NP-hard)

10
Challenges
  • Ordering generate a linear embedding
  • feasibility lengths of parallel DAG paths are
    consistent
  • optimality remove minimum number of edges such
    that DAG is feasible (NP-hard)

11
The real world
  • Use of scaffolds
  • Analysis longest unambiguous sub-graphs
  • Finishing present all reliable relationships
    between contigs
  • Sources of error
  • mis-assemblies
  • sizing errors (increases with library size)
  • chimeras

12
Ambiguous scaffold
13
Hierarchical scaffolding
  • For each contig pair, consolidate all linking
    data into a single relationship 2 correct
    links required

14
Hierarchical scaffolding
  • Use most reliable links to build scaffolds
  • Repeatedly build super-scaffolds based on less
    reliable linking data

15
Rationale
Problem complexity
problem size (nodes)
error rate
Hierarchical step
16
Linking information
  • Overlaps
  • Mate-pair links
  • Similarity links
  • Physical markers
  • Gene synteny

reference genome
physical map
17
BAMBOO (BAMBUS)
Best effort Attempt Multiple Branches
allowed Order, Orient
18
Inputs
  • Set of contigs names and lengths
  • Groups of contig links
  • groups correspond to quality of links
  • link relative distance between contig origins
    relative orientation of contigs
  • Priorities for each group specify order in
    which links are considered

19
Outputs
  • XML representation of layout
  • contig orientations
  • contig position (x-coordinate of contig origin)
  • links used to construct layout
  • Graphical display of the layout
  • uses GraphViz package from ATT

20
1.0 release
  • XML input not yet supported
  • All scaffold placed in the same output file
  • Only Linux executable released
  • Hacker friendly

21
Current release 2.33
  • XML input and more general output module
  • Collection of input modules from common assembly
    formats
  • Better handling of priority data
  • Repeat masking features
  • More platforms supported and source code released
    as open source
  • http//amos.sourceforge.net

22
Future enhancements
  • Option to generate un-ambiguous (non-branching)
    scaffolds
  • Better layout algorithms
  • Specialized drawing tools
  • Interactive browser
  • Represent/handle multiple haplotypes

23
Acknowledgements
  • Dan Kosack
  • Steven Salzberg
  • Martin Shumway
  • Hean Koo
  • Luke Tallon
  • Jessica Vamathevan
Write a Comment
User Comments (0)
About PowerShow.com