Text Comparison of Genetic Sequences - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Text Comparison of Genetic Sequences

Description:

Comparing Two Strings. If X and Y are strings, how similar are they? ... Cormode, G., Muthukrishnan, S., 'The String Edit Distance Matching Problem with Moves' ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 22
Provided by: reu
Category:

less

Transcript and Presenter's Notes

Title: Text Comparison of Genetic Sequences


1
Text Comparison of Genetic Sequences
  • Shiri Azenkot
  • Pomona College
  • DIMACS REU 2004

2
Comparing Two Strings
  • Definition A string is a set of consecutive
    characters.
  • Examples
  • hello world
  • 0123456
  • DNA sequences
  • text file

3
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.
  • Allowed operations
  • Insert a character
  • Delete a character
  • Replace a character
  • Running time O(mn) with a dynamic programming
    algorithm

4
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X abcdef Y defabc d(X, Y) ? operations

5
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X bcdef Y defabc d(X, Y) ? operations
1
6
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X cdef Y defabc d(X, Y) ? operations
2
7
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X def Y defabc d(X, Y) ? operations
3
8
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X defa Y defabc d(X, Y) ? operations
4
9
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X defab Y defabc d(X, Y) ? operations
5
10
Comparing Two Strings
  • If X and Y are strings, how similar are they?
  • Edit distance, d(X, Y) smallest number of
    operations needed to make X look like Y.

X defabc Y defabc d(X, Y) 6 operations
6
Does this seem too high?
11
Edit Distance with Moves
  • d(X, Y) smallest number of operations to make X
    look like Y.
  • New operation move a substring

X abcdef Y defabc d(X, Y) 1
12
Edit Distance with Moves
  • d(X, Y) smallest number of operations to make X
    look like Y.
  • New operation move a substring
  • Some applications
  • Computational biology DNA sequences
  • Text editing
  • Webpage updating

13
Edit Distance with Moves
  • Edit Sensitive Parsing (ESP) Algorithm
  • Parse each string into a 2-3 tree
  • Compare nodes (substrings) of the trees to
    compute edit distance approximation
  • The problem is NP-hard
  • Algorithm approximates d(X, Y) deterministically
  • Run time O(n log n)

14
Edit Distance with MovesAlgorithm
  • Parse each string into a 2-3 tree
  • Every node represents a substring
  • X bagcabagehead

15
Edit Distance with MovesAlgorithm
  • Parse each string into a 2-3 tree
  • Every node represents aa substring
  • Y cabageheadbag

16
Edit Distance with MovesAlgorithm
  • Compare nodes (substrings) of the trees to
    compute edit distance approximation
  • 2.1 Find frequencies of occurrence of each
    substring.
  • X

b
a
g
c
a
b
a
g
e
h
e
a
d
17
Edit Distance with MovesAlgorithm
  • Compare nodes (substrings) of the trees to
    compute edit distance approximation
  • 2.1 Find frequencies of occurrence of each
    substring.
  • Y

caba gehea dbag
1 1 1
ca ba geh ea db ag
1 1 1 1 1
1
a
c
a
a
e
h
e
a
b
g
b
g
d
18
Edit Distance with MovesAlgorithm
  • Compare nodes (substrings) of the trees to
    compute edit distance approximation
  • 2.1 Find frequencies of occurrence of each
    substring.
  • 2.2 Subtract characteristic vectors to get
    approximation for d(X, Y)

Bagca bagehead
caba gehea dbag
-
1 1
1 1 1
ca ba geh ea db ag
bag ca ba geh ead
1 1 1 1 1 1
1 1 1 1 1
19
Edit Distance with MovesAlgorithm
  • Compare nodes (substrings) of the trees to
    compute edit distance approximation
  • 2.1 Find frequencies of occurrence of each
    substring.
  • 2.2 Subtract characteristic vectors to get
    approximation for d(X, Y)

Actual edit distance with moves?
1
d(bagcabagehead, cabageheadbag)
20
Edit Distance with Moves
  • Goals for this project
  • Implement this algorithm
  • Test algorithm on DNA sequences
  • Questions to think about
  • How accurate is the approximation?
  • How applicable is this technique for comparing
    large biological sequences?
  • This algorithm finds repeating structures within
    the sequences when comparing them. Do these
    structures have significance?
  • Do such structures exist for real sequences?

21
Acknowledgements
  • Mentor Graham Cormode, DIMACS Postdoc
  • DIMACS REU 2004
  • References
  • Benedetto, D., Caglioti E., Loreto V., Language
    Trees and Zipping. Physical Review Letters, 2002
  • Cormode, G., Muthukrishnan, S., The String Edit
    Distance Matching Problem with Moves.
Write a Comment
User Comments (0)
About PowerShow.com