CSE182-L10

About This Presentation

Title:

CSE182-L10

Description:

Keep the 'high-scoring' ones as evidence of true overlap. What is the problem? ... k-mer based overlap (Piegeonhole principle again) Consider a 25bp sequence. ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 40

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L10

1
CSE182-L10

LW statistics/Assembly

2
Whole Genome Shotgun

Break up the entire genome into pieces
Sequence ends, and assemble using a computer
LW statistics Repeats argue against the success
of such an approach

Alternative build a roadmap of the genome, with
physical clones mapped for each region. Sequence
each of the clones, and put them together
3
Questions

Algorithmic How do you put the genome back
together from the pieces?
Statistical? How many pieces do you need to
sequence, etc.?
The answer to the statistical questions had
already been given in the context of mapping, by
Lander and Waterman.

4
Lander Waterman Statistics

The fragments are falling randomly on the genome
Overlapping fragments form islands of contiguous
sequence.
Ideally, we want one island for each chromosome.
How many fragments should we sequence?

L
G
5
Lander Waterman Statistics
L
G
6
LW statistics questions

As the coverage c increases, more and more areas
of the genome are likely to be covered. Ideally,
you want to see 1 island.

Q1 What is the expected number of islands?
Ans N exp(-c?)
The number increases at first, and gradually
decreases.

7
Analysis Expected Number Islands

Computing Expected islands.
Let Xi1 if an island ends at position i, Xi0
otherwise.
Number of islands ?i Xi
Expected islands E(?i Xi) ?i E(Xi)

8
Prob. of an island ending at i
i
L
T

E(Xi) Prob (Island ends at pos. i)
Prob(clone began at position i-L1
AND no clone began in the next L-T positions)

9
LW statistics

PrIsland contains exactly j clones?
Consider an island that has already begun. With
probability e-c?, it will never be continued.
Therefore
PrIsland contains exactly j clones

Expected j-clone islands

10
Expected of clones in an island

Expected of clones in an island

Q How? Why do we care?
Often, at the beginning of a genome project, we
do not know the length of the genome. This
equation helps us determine the length.
11
Expected length of an island
12
Whole Genome Sequencing Assembly
13
Whole Genome Shotgun

Break up the entire genome into pieces
Sequence ends, and assemble using a computer
LW statistics Repeats argue against the success
of such an approach

14
Assembly Basics

Three main components
Overlap
Layout
Consensus

15
Overlap

Given a pair of fragments s1 and s2, do they
belong together?

How would you compute such a match?

16
Overlap

Si,j optimum score of an alignment of
s11..i against a suffix of s21..j

j
i

The best prefix-suffix alignment is given by

Maxi Si,n

17
Overlap Detection

Compute the best prefix-suffix alignments between
each pair of fragments.
Keep the high-scoring ones as evidence of true
overlap.
What is the problem?

18
Overlap detection problem

Consider the number of fragments. The LW
statistics say that we need good coverage (c8,
10) to get most of the base-pairs.
G 3000Mb, L500
Coverage LN/G 10
N 103109/500 6107
Number of comparisons needed 3.6 1015
Not good! (Only a small fraction are true
overlaps)

19
k-mer based overlap (Piegeonhole principle again)

Consider a 25bp sequence.
Expected number of occurrences in the genome
31094-25 210-6
A 25-bp sequence appears is unique to the genome!
Two overlapping sequences should share a 25-mer
Two non-overlapping sequences should not!

25bp
20
Sorting k-mers

Build a list of k-mers that appear in the
sequences and their reverse complements
Create a record with 4 entries
K-mer
Sequence number
Position in the sequence
Reverse complementation flag
Sort a vector of these according to k-mer
How many records per k-mer are expected?
If number of records exceeds threshold, discard
(why?)

K-mer
S.id Pos.
21
Alignment module

Coalesce k-mer hits into longer, gap-free partial
alignments.
These extended k-mer hits are saved.
For each pair of sequences, form a directed
graph.
For each maximal path in the graph, construct an
alignment.
Refine alignment via banded DP

22
Problem2 Size

Islands might simply be too small in length
? (1-T/L) (1-50/500) 0.9, c 8.
Islands N e-c? 45K
Size of an island 54K
Not enough to make it an acceptable assembly!
PLUS, there is the problem of Repeats, Chimerism
etc.

23
Solution 2 Clones can have mate-pairs

Recall that we sequence about 1000bp of the end
of a clone
If we sequenced both ends, we get extra
information, particularly if we know the length
of the original clone.

24
Mate Pairs

Mate-pairs allow you to merge islands (contigs)
into super-contigs

25
Super-contigs are quite large

Make clones of truly predictable length. EX 3
sets can be used 2Kb, 10Kb and 50Kb. The
variance in these lengths should be small.
Use the mate-pairs to order and orient the
contigs, and make super-contigs.

26
Whole genome shotgun

Input
Shotgun sequence fragments (reads)
Mate pairs
Output
A single sequence created by consensus of
overlapping reads
First generation of assemblers did not include
mate-pairs (Phrap, CAP..)
Second generation CA, Arachne, Euler
We will discuss Arachne, a freely available
sequence assembler (2nd generation)

27
Problem 3 Repeats
28
Repeats Chimerisms

40-50 of the human genome is made up of
repetitive elements.
Repeats can cause great problems in the assembly!
Chimerism causes a clone to be from two different
parts of the genome. Can again give a completely
wrong assembly

29
Repeat detection

Lander Waterman strikes again!
The expected number of clones in a Repeat
containing island is MUCH larger than in a
non-repeat containing island (contig).
Thus, every contig can be marked as Unique, or
non-unique. In the first step, throw away the
non-unique islands.

Repeat
30
Detecting Repeat Contigs 1 Read Density

Compute the log-odds ratio of two hypotheses
H1 The contig is from a unique region of the
genome.
The contig is from a region that is repeated at
least twice

31
Detecting Chimeric reads

Chimeric reads Reads that contain sequence from
two genomic locations.
Good overlaps G(a,b) if a,b overlap with a high
score
Transitive overlap T(a,c) if G(a,b), and G(b,c)
Find a point x across which only transitive
overlaps occur. X is a point of chimerism

32
Contig assembly

Reads are merged into contigs upto repeat
boundaries.
(a,b) (a,c) overlap, (b,c) should overlap as
well. Also,
shift(a,c)shift(a,b)shift(b,c)
Most of the contigs are unique pieces of the
genome, and end at some Repeat boundary.
Some contigs might be entirely within repeats.
These must be detected

33
Creating Super Contigs
34
Supercontig assembly

Supercontigs are built incrementally
Initially, each contig is a supercontig.
In each round, a pair of super-contigs is merged
until no more can be performed.
Create a Priority Queue with a score for every
pair of mergeable supercontigs.
Score has two terms
A reward for multiple mate-pair links
A penalty for distance between the links.

35
Supercontig merging

Remove the top scoring pair (S1,S2) from the
priority queue.
Merge (S1,S2) to form contig T.
Remove all pairs in Q containing S1 or S2
Find all supercontigs W that share mate-pair
links with T and insert (T,W) into the priority
queue.
Detect Repeated Supercontigs and remove

36
Repeat Supercontigs

If the distance between two super-contigs is not
correct, they are marked as Repeated
If transitivity is not maintained, then there is
a Repeat

37
Filling gaps in Supercontigs
38
Consensus Derivation

Consensus sequence is created by converting
pairwise read alignments into multiple-read
alignments

39
Summary

Whole genome shotgun is now routine
Human, Mouse, Rat, Dog, Chimpanzee..
Many Prokaryotes (One can be sequenced in a day)
Plant genomes Arabidopsis, Rice
Model organisms Worm, Fly, Yeast
A lot is not known about genome structure,
organization and function.
Comparative genomics offers low hanging fruit

Write a Comment

User Comments (0)

About PowerShow.com

CSE182-L10 - PowerPoint PPT Presentation

CSE182-L10

Keep the 'high-scoring' ones as evidence of true overlap. What is the problem? ... k-mer based overlap (Piegeonhole principle again) Consider a 25bp sequence. ... – PowerPoint PPT presentation