Algorismes de cerca - PowerPoint PPT Presentation

About This Presentation
Title:

Algorismes de cerca

Description:

which is an affine-gap function. How is the best alignment found?. C T A C T A C T A C G T ... 2.2 Affine-gap model score. In both cases we know which cell contributes ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 28
Provided by: lcl53
Learn more at: https://www.cs.upc.edu
Category:

less

Transcript and Presenter's Notes

Title: Algorismes de cerca


1
Algorismes de cerca
Algorismes de cerca definició del problema
(text,patró)
depèn de què coneixem al principi
  • Cerca exacta
  • Només el text ----gt Estructurar el text (suffix
    tree)
  • Només el/s patró/ns ---gt Estructurar el/els
    patró/ns
  • 1 patró ---gt Lalgorisme depèn de la llargada i
    ?
  • k patrons ---gt Lalgorisme depén del nombre k,
    la llargada i ?
  • Extensions
  • Expressions regulars
  • Cerca aproximada

depèn de la llargada del patró
  • Cerca probabilista

2
2.2 Pairwise alignment
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) from the alphabet a,c,t,g we
say that A and B from a,c,t,g,- are aligned
iff
  • A and B become A and B if gaps ( ) are
    removed.
  • AB
  • For all i, it is not possible that ai bi -

MALIG (an example)
How many alignments of two sequences exist?
Which is the best alignment?
3
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
(a1,b1)
4
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
5
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
? ?
6
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
5 7
5 7
?
7
2.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) then
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with ( an ,
-) (a1a2...an ,b1b2...bm-1) those
that end with ( - , bm) (a1a2...an-1
,b1b2...bm-1) those that end with ( an , bm)
1
1
1
1
1 1 1
3
5 7
  • 25
  • 25 63

5 7
But, what is the assymptotic value?
8
2.2 Assymptotic value
As
(a1a2...an ,b1b2...bn)
and
n! nn e-n (Stirling approximation)
then
(a1a2...an ,b1b2...bn) gt 22n
9
2.2 Best alignment
How can an alignment be scored?
catcactactgacgactatcgtagcgcggctatacatctacgccaa-
ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt-
attgcggctacacactacgcacaactactgtatgtcgc-cgg----


Then we assign a score for each case, for example
1,-1,-2.
How can the best alignment be found?
10
2.2 Edit distance and alignment of strings
The best alignment of two strings
is related with the edit distance, first
discussed in 1966...
The most efficient algorithm was proposed in
1968 and in 1970
using the technique called Dynamic programming
11
2.2 Best alignment
C T A C T A C T A C G T A C T G A
12
2.2 Best alignment
C T A C T A C T A C G T A C T G A
13
2.2 Best alignment
C T A C T A C T A C G T A C T G A
The cell contains the score of the best
alignment of AC and
CTACT.
14
2.2 Best alignment
C T A C T A C T A C G T

A
C

T G A
C T A C T A C T A C G T 0 -2 -4-6 -8
A-2 C-4 T -6 G A
s(AC,CTA)-2
s(A,CTA)1
BA(AC,CTAC) best
s(AC,CTAC)max
s(A,CTAC)-2
15
Best alignment
Given the maximum score, how can the best
alignment be found?
  • Quadratic cost in space and time
  • Up to 10,000 bps sequences in length

Download alggen tool
16
2.2 Some slides revisited
  • We have developed the theory according to the
    following principles
  • 1) Both sequences have a similar length
    (global).
  • 2) The model of gaps is linear

If there are k consecutive gaps the penalty
scores k(-2).
17
2.2 Semiglobal pairwise alignment
  • Assume that we have sequences with different
    length
  • S1
  • S2

It is meaningless to introduce gaps until both
sequences have similar length .
The most probable alignment should be
Final gaps
Initial gaps
How can these alignments be found?
18
2.2 Semiglobal pairwise alignment
  • Note that

Initial gaps
Final gaps
C T A C T A C T A C G T A C T
19
2.2 Semiglobal pairwise alignment
Given a cell
C T A C T A C T A C G T A C T
0 0 0 0 0 0 0 0 0
0 0 0
0
The cell contains the score of the best
alignment of CTA with the empty sequence.
20
2.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A C T
The contribution of the initial gaps is
disregarded, then
but, what happens with the final gaps?
21
2.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A 1 C 2 T
3
How does the algorithm search for the best
alignment?
Practice with the alggen tool.
22
2.2 Affine-gap model score
Given the following alignments
that have the same score
a g t a c c c c g t a g a g t - c c - - g t a -
a g t a c c c c g t a g a g t - c - c - g t a -
a g t a c c c c g t a g a g t - c - - c g t a -
a g t a c c c c g t a g a g t - - c c - g t a -
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Which is the most reliable case from a
biological point of view?
23
2.2 Affine-gap model score
Then, how can we distinguish between consecutive
gaps and separated gaps?
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Then, the penalty of k consecutive gaps becomes
OG (k-1) EG which is an
affine-gap function.
How is the best alignment found?.
24
2.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Smallest arrows refer to the introduction of an
opening gap. Largest arrows refer to the
introduction of an extension gap.
But from which cell do the largest arrows
originate?
25
2.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Acces to clustalW http//www.ebi.ac.uk/clustalw
26
2.2 Local alignment
Given two sequences, we can consider the
alignments of all their substrings
how can the
best of them be found?
Two questions arise - how can the alignments
be compared? - how can the best one be selected?
27
2.2 Local alignment
Given a path
Imagine the graph of the scores can the best
subalignments be detected?

It suffices to compare the value of each cell
with zero!
Write a Comment
User Comments (0)
About PowerShow.com