MAVID: Constrained Ancestral Alignment of Multiple Sequence - PowerPoint PPT Presentation

About This Presentation
Title:

MAVID: Constrained Ancestral Alignment of Multiple Sequence

Description:

Practical for sequence for alignments of large genomic region ... cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 39
Provided by: czo9
Category:

less

Transcript and Presenter's Notes

Title: MAVID: Constrained Ancestral Alignment of Multiple Sequence


1
MAVID Constrained Ancestral Alignment of
Multiple Sequence
  • Author Nicholas Bray and Lior Pachter

2
Outline
  • AVID
  • MAVID
  • Progressive alignment
  • Constraints
  • Tree Building
  • Experimental Results

3
AVID A Global Alignment Program
  • Fast
  • Memory efficient
  • Practical for sequence for alignments of large
    genomic region
  • Sensitive in finding homologous regions
  • Specific and avoids the false-positive problems

4
Algorithm
  • Repeat Masking (Optional)
  • Finding Matches Using Suffix Trees
  • Anchor Selection
  • Recursion

5
Repeat Masking
Match finding
Recursion
Anchor selection
Enough anchors?
Base pair alignment
Split sequences using anchors
6
Repeat Masking (Optional)
  • RepeatMasker (http//ftp.genome.washington.edu/RM/
    RepeatMasker.html)
  • Repeat matches
  • Clean matches

Clean matches
Repeat matches
7
Finding Matches Using Suffix Trees
8
Finding Matches Using Suffix Trees
  • Maximal repeated substring (Match)
  • Every subsequence that contains it is not
    repeated in the string
  • Maximal matches between two sequence
  • Pairs of matching subsequences whose flanking
    bases are mismatches
  • Transform

9
Maximal repeated substring
Maximal matches between two sequence
Transform
10
Anchor Selection
  • Eliminate noisy matches (those less than half the
    length of the longest match)
  • The left matches are ordered by
  • Long clean -gt short clean -gt long repeat -gt short
    repeat

11
Anchor Selection
  • A variant of Smith-Waterman algorithm (no
    overlapping)
  • Gap score 0
  • Mismatch score 8
  • Match score

10 bp
12
Recursion
13
Condition
  • There are still significant matches
  • The anchor set is gt50 of the length of the
    sequence
  • Recursion
  • Otherwise
  • Needleman-Wunsch algorithm
  • No significant matches
  • Short sequence (lt4kb)
  • Needleman-Wunsch algorithm
  • Long sequence
  • Trivial alignment (gap)

14
MAVID
  • Rapidly aligning multiple large genomic regions
  • Incorporating biologically meaningful heuristics
  • Sound alignment strategies

15
Method
  • Core progressive ancestral alignment, which
    incorporate preprocessed constraint
  • Terminology
  • Match
  • Similar (may not exactly match) region between
    two sequences
  • Constraint
  • The order of positions of alignment

16
Standard progressive alignment
  • Compute the distance matrix by aligning all pairs
    of sequences
  • Build a phylogenetic tree (guide tree) from the
    distance matrix
  • Cluster
  • Midpoint method
  • Progressively align the sequence according to the
    branching order in the guide tree
  • Aligning two alignments
  • An alignment is viewed as a sequence

17
Method
18
Key difference
  • Instead of aligning alignments, we first infer
    ancestral sequences of alignments using
    maximum-likelihood estimation within a
    probabilistic evolutionary model
  • maximum-likelihood estimation
  • a popular statistical method used to make
    inferences about parameters of the underlying
    probability distribution of a given data set

19
Key difference
  • The ancestral sequences are then aligned with
    AVID
  • The scores of the Smith-Waterman step are
    assigned according to the branch length of the
    two alignments
  • The alignment of the ancestral sequences is then
    used to glue two alignments. Gaps in the
    ancestral sequences lead to gaps in the multiple
    alignment

20
Alignment A
Ancestral A
Ancestral B
Alignment B
AVID
21
AVID with preprocessed data
  • Gene predictions using GENSCAN
  • Protein alignments using BLAT
  • Finding exon matches without using suffix tree
  • In addition, the exon matches can be used shape
    the final multiple alignment

22
MAVID(Constraints, Tree building, and
Experimental results)
  • Speaker ???
  • 2005/12/07

23
Constraints(1/3)
  • Notation ai bj
  • This means that position i in sequence a must
    appear before position j in sequence b in the
    multiple sequence alignment.

24
Constraints(2/3)
ai
a
cy
c
cx
b
bj
If x y, then ai cx cy bj ,and so ai bj
by transitivity.
25
Constraints(3/3)
  • The above information can be used in the
    alignment of the ancestral sequences by requiring
    potential anchors between the sequences to
    satisfy the constraints.

26
Prime Constraints(1/4)
  • Consider every triplet of sequences (a, b, c)
    with a in u, b in v, and c not in x.
  • Every triplet can provide potential constraints
    for the alignment.
  • If there are n sequences, there are O(n3) such
    triplets.

x
Too many constraints!
u
v
27
Prime Constraints(2/4)
  • Actually, we dont need to find all possible
    constraints, many of which will be redundant.
  • Instead, we wish to find a set of prime
    constraints
  • In this set, no constraint is implied by the
    others.
  • Such a set can be inferred from the homology map.

28
Illustration
29
Prime Constraints(3/4)
  • If there are m sets of orthologous exons, then at
    node x there can be at most O(m) prime
    constraints.
  • The sets of all prime constraints can be found in
    O(mk2), where k is the number of leaves below x.

30
Prime Constraints(4/4)
  • Matches between the ancestral sequences that are
    inconsistent with this set of constraints can be
    filtered out in time O(N logm), where N is the
    total number of matches.
  • For typical values of m and k, the time taken
    computing and utilizing the constraints is
    negligible.

31
Tree Building(1/3)
  • Most multiple alignment programs require pairwise
    alignments of all the sequences to build in
    initial guide tree. (Quadratic number of sequence
    alignments)
  • We utilize an iterative method to obtain a guide
    tree using only linear number of alignments.

32
Tree Building(2/3)
  • The initial guide tree is selected randomly from
    the set of complete binary trees.
  • The sequences are aligned using this random tree,
    and then a phylogenetic tree is inferred from the
    resulting multiple alignment.
  • The above process is iterated until the alignment
    and tree are satisfactory.

33
Tree Building(3/3)
  • Instead of computing all pairwise alignments,
    only O(nk) alignments are necessary to perform n
    iterations with k sequences.
  • We found that for typical alignment problems,
    only a small number of iterations were necessary.

34
Experimental Results 1
  • A human, mouse, and rat whole-genome multiple
    alignment.
  • A homology map for the genomes was built by C.
    Dewey, and was used to generate gene anchors and
    constraints.
  • Chromosome 20 was chosen because it aligns almost
    completely with mouse chromosome 2.

35
Experimental Results 1 (cont.)
Coverage of human chromosome 20 RefSeq exons by
the MAVID alignments. Of a total of 3927 exons,
only six were not in the homology map. A total of
53.5 of the exons were covered by precomputed
exon anchors in either mouse or rat. The
remaining exons are mostly aligned by MAVID,
resulting in 93.6 of the exons covered by
alignment in either mouse or rat.
36
Experimental Results 2
  • Alignment of 21 Organisms
  • We aligned 1.8 Mb of human sequence together with
    the homologous regions from 20 other organisms of
    a total 23 Mb of sequence.
  • Baboon, cat, chicken, chimp, cow, dog, dunnart,
    fugu, hedgehog, horse, lemur, macaque, mouse,
    opossum, pig, platypus, rabbit, rat, tetraodon,
    and zebra-fish.

37
Experimental Results 2(cont.)
  • The MAVID alignments were compared with MLAGAN,
    version 1.1(Brudno et al. 2003).
  • MLAGAN is the only other program we know of that
    is able to align the 21 sequences in a reasonable
    period of time.

38
Experimental Results 2(cont.)
  • MAVID and MLAGAN both aligned sequences
    correctly.
  • MAVID took 40 min, while MLAGAN took roughly 6h.
Write a Comment
User Comments (0)
About PowerShow.com