Applications of Exhaustive Search - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Applications of Exhaustive Search

Description:

Like the detective in the Edgar Allan Poe story, we can imagine be confronted ... This is similar to the Edgar Allan Poe story, except the pirate was drunk when ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 36
Provided by: randyz
Category:

less

Transcript and Presenter's Notes

Title: Applications of Exhaustive Search


1
Lecture 4
  • Applications of Exhaustive Search
  • Jones Pevzner, Secs. 4.1-4.6

2
Restriction Mapping Problem
  • Restriction endonucleases cut DNA at particular
    recognition sequences
  • Before the evolution of DNA sequencing methods,
    restriction digests provided limited sequence
    information about a sample of DNA.
  • By varying experimental conditions, you can
    perform a partial or a complete digest - maximal
    information is gained by performing a partial
    digest.

3
Formalizing the problem
0
2
4
10
7
End- point
End- point
Restriction sites
Multiset - formed of all pairwise distances
between sites
For the example above,
4
Inverse problem
  • Given the multiset, can we predict the
    distribution of restriction sites along the
    sample of DNA?
  • This maps to the Turnpike problem of Computer
    Science - given the set of all distances between
    exits on the turnpike, can you determine the
    distribution of exits along the highway (i.e. the
    roadmap)?

5
Solutions are not always unique
  • Add a set increment to all the site locations,
    and you get the same digest - symbolizedAlso
    the reflectionEquivalent restriction site
    maps (in the sense of leading to the same partial
    digest) are said to be homometric.

6
A quick way to generate two homometric site sets
Given sets U and V, then
are guaranteed to be homometric
(see example on p. 87)
7
A brute-force (exhaustive) algorithm
Given a multiset L of m distances, and n sites
(where m n(n-1)/2 BRUTEFORCEPDP( L, n )M lt-
maximal element in L for (every set of n-2
integers, 0ltx2ltltxn-1ltM and xi in L) X lt-
0, x2,,xn-1,M Form DeltaX from X if(
DeltaX L ) return X output no solution
8
Time complexity of BRUTEFORCEPDP( L, n )
Assume that every pass through the loop takes
about the same time Then the exec time scales as
According to Jones Pevzner, this goes as
O(n2n-4). (I have not checked that!) Is this a
tractable algorithm?
9
A practical algorithm for PDP
Due to Skiena (1990). Work through the example
from text Given the multiset L
2,2,3,3,4,5,6,7,8,10, find the vector of sites
X. First, the width of the fragment is clearly
10 WIDTH 10. X (0, 10). L 2,2,3,3,4,5,6,7,8
What is the next biggest fragment? Its 8. The
next site must be at 8 or 2 (WIDTH - 8). At this
stage the problem is still symmetric, so lets
choose X(0,2,10). My new site adds fragments 2,
8 - remove them. L 2,3,3,4,5,6,7
10
Skiena PDP example (cont)
Where we are X(0,2,10). L 2,3,3,4,5,6,7 The
next biggest fragment is 7. Possible positions
for the next site are 7, and WIDTH - 7 3. Try
3 New fragments generated against my current set
of sites will include one of size 1 thats not
in L, so I reject this choice. Try 7 New
fragments generated are size 7, 5, 3. This choice
is OK X (0, 2, 7, 10), L 2,3,4,6 Next
biggest fragment is 6 possible positions are 6,
and WIDTH - 6 4. Can we immediately reject the
former?
11
Skiena PDP example (cont)
Again, we currently have X (0, 2, 7, 10), L
2,3,4,6 Try site at position 4 New fragments
generated are of sizes 4, 2, 3, 6. All of these
are in L, so now we have X (0, 2, 4, 7, 10), L
L is empty, so were finished!
12
Pseudocode for the Skiena algorithm
This is implemented as a recursive algorithm.Why
might you expect that??
PARTIALDIGEST(L) WIDTH lt- max element in
L DELETE(WIDTH,L) X lt- (0, WIDTH) PLACE(L, X)
PLACE on next page
13
PLACE() subroutine
PLACE(L,X) if L empty output X return y lt-
max element in L if DELTA(y,X) contained in
L ADD(y,X) DELETE(DELTA(y,X),
L) PLACE(L,X) DELETE(y,X) ADD(DELTA(y,X),
L) if DELTA(WIDTH-y,X) contained in
L ADD(WIDTH-y,X) DELETE(DELTA(WIDTH-y,X),L)
PLACE(L,X)return
14
Time complexity for Skiena
Complex to analyze - depends on how often
alternatives have to be considered. Best case -
quadratic Worst case - exponential
15
Finding motifs
  • Like the detective in the Edgar Allan Poe story,
    we can imagine be confronted with a text that has
    been encoded using a strange character set.
    Following the detective, we might expect that
    combinations of characters that occur more
    frequently than expected by chance might be
    meaningful.
  • The simplest approach is to simply keep count of
    all words of some selected size L that appear in
    the text. We call these
  • words, or
  • L-mers, or
  • L-tuples

16
Example - counting words
SEQUENCE ACTGTCAAAAGTCATCAAAATCG ACTG - 1 CTGT -
1 TGTC - 1 GTCA - 2 TCAA - 2 CAAA - 2 AAAA -
2 AAAG - 1 AAGT - 1 AGTC - 1 TCAT - 1 CATC -
1 ATCA - 1 AAAT - 1 AATC - 1 ATCG - 1 How did I
generate this? How would you do this in a C
program?
17
Its not that easy!
  • Motifs in DNA sequences are usually NOT strictly
    conserved! You cant just count words. This is
    similar to the Edgar Allan Poe story, except the
    pirate was drunk when he encoded the message, and
    didnt spell very well to begin with.
  • Consider the table of NF-kappaB binding sites
    (Table 4.2)

18
How to define motifs/profiles
Best done via an alignment. We imagine a set of
aligned sequences from which we can extract a
window that is L nucleotides wide. If there are
t sequences, then we have a profile matrix that
has t rows and L columns. We can count how many
As, Ts, Gs and Cs appear in each column, which
provides a measure of the variability along the
profile. It also lets us define a consensus
sequence.
19
NF-kappaB Example
etc T T T G
G G A G T C C C T C G G T G A T
T T T C T A G G G G A A G A C C
---------------------------------- A2 3 1 0
0 1 16 3 1 2 4 1 T12 3 1 0 4 1 0 9
15 11 5 6 G1 0 16 18 14 16 1 1 1 0 0
0 C3 12 0 0 0 0 1 5 1 5 9 11
CONSENSUS T C G G G G A T T T C C
20
A way to visualize consensus sequence - sequence
logo
T. D. Schneider and R. M. Stephens, Sequence
Logos A New Way to Display Consensus Sequences
Nucl. Acids Res. 18 6097-6100, 1990.
21
How do we make the alignments?
  • We dont know the motifs in advance so, we need
    to construct ALL alignments of length L.
  • Assume we have t sequences to align We can
    construct a vector S of t integers that specifies
    a starting position in each sequence for the
    L-mer to be extracted from that sequence. The
    L-mers are then assembled to make a profile
    matrix.
  • By generating all possible vectors S, we make all
    possible profiles of width L.

22
Now weve made all the profiles how do we rank
them?
  • Interesting profiles depart from uniformity
    somewhere along their length if the consensus is
    weak, the profile might represent random
    variation.
  • One way to quantitate the strength of a profile
    is via a profile score.

23
Computing a Profile Score
  • Let P(S) be the profile matrix corresponding to
    positions vector S, and MP(s)(j) be the maximum
    character count in the jth column of the profile.
    Then The maximum score possible is tL. The
    worst is tL/4 (equal mix of characters at each
    position in the profile.

24
Motif-Finding Problem
  • Given a set of DNA sequences, find a set of
    L-mers, one from each sequence, that maximimzes
    the consensus score.
  • Clearly, this can be done by exhaustive search!

25
Another angle on the problem
  • Compute the Hamming distance dH between two
    strings of equal length as the number of
    positions that differ between the two
    strings dH(ATTGTC,ACTCTC) 2
  • Given a profile defined by S, we define the total
    Hamming profile distance between an L-mer v and
    the aligned profilewhere si is the ith
    sequence in the profile.

26
Total Hamming Distance
  • For a given L-mer v, and a set of t sequences,
    the Total Hamming Distance is the minimum Hamming
    Profile distance for v, taken over all possible
    profiles (choices for the vector S).
    Symbolically,

27
The Median String Problem
  • Given a set of DNA sequences, find a string v of
    length L which has minimum Total Hamming
    Distance. This is the median string.
  • Presents a double minimization problem -
  • For a given string v, find the motif with minimal
    Hamming Profile Distance
  • Minimize the minimal Hamming Profile Distance
    over all choices for v.

28
The Motif-Finding and Median String Problems are
Equivalent!
First, we observe that the consensus string for a
given profile IS ALSO the string with minimum
Hamming Profile distance.Second, we observe
that the Hamming profile distance and the profile
score are linked. If w is the consensus sequence
and v is an arbitrary L-mer, So, maximizing
the profiles score is the same as minimizing the
Total Hamming Distance!
29
Control Flow,Functions Program Structures
  • Kernighan Ritchie
  • Chapts. 3 4

30
Control elements
  • Statement blocks
  • Curly braces are used to group statements into
    units they become syntactically equivalent to a
    single statement.
  • Blocks are used with if-else, while, for and
    other control statements

31
if-else constructs
  • Please follow the advice of the text and ALWAYS
    enclose clauses in curly braces neglecting to do
    this is a frequent source of hard-to-find errors.

32
switch statement
  • This statement is preferred for selecting from a
    large number of options.
  • Be sure to understand how execution falls
    through after a matching case block is executed,
    and the necessity of using break to escape this
    behavior.

33
for and while loops
  • Note that these constructs are very flexible
    often a matter of personal preference which to
    use. Be sure to understand
  • The three fields of the for statement
  • The roles of break and continue
  • Possibilities of using commas to separate
    statements (NOT recommended in general!)

34
goto
  • In general, dont.
  • That said, in some situations you need to do
    something silly to get around it. It should not
    be thought of as forbidden, just not recommended.

35
Functions
  • Functions allow programs to be broken up into
    functional pieces they may take arguments and
    may return values.
  • Arguments must be supplied in a specified order.
  • The prototype of a function declares the list of
    expected arguments and also the return type of
    the function.
  • If a prototype is declared outside of any
    function in a file, it is globally available to
    all the functions that appear after it.
  • Prototypes for library functions are declared in
    header files, like stdio.h.
Write a Comment
User Comments (0)
About PowerShow.com