An Efficient Index Structure for String Databases - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

An Efficient Index Structure for String Databases

Description:

Find similar substrings in a large database, that is similar to a given query ... searching keywords through the net: usually by 'mtallica' we mean 'metallica' ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 51

Provided by: nik49

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient Index Structure for String Databases

1
An Efficient Index Structure for String Databases

Tamer Kahveci
Ambuj K. Singh
Presented By
Atul Ugalmugale/Nikita Rasam

Issue ?
Find similar substrings in a large database, that
is similar to a given query string quickly,
using a small index structure
In some applications we store, search and analyze
long sequences of discrete characters, which we
call strings
There is a frequent need to find similarities
between genetic data, web data and event
sequences.

Applications ?
Information Retrieval A typical application of
information retrieval is text searching given a
large collection of documents and some text
keywords we want to find the documents which
contain these keywords.

searching keywords through the net usually by
mtallica we mean metallica

Computational Biology The problem is similar in
computational biology here we have a long DNA
sequence and we want to find subsequences in it
that match approximately a query sequence.
ATGCATACGATCGATT
TGCAATGGCTTAGCTAAnimal species from the same
family are bound to have more similar DNAs

Video data can be viewed as an event sequence if
some pre-specified set of events are detected and
stored as a sequence. Searching similar event
subsequences can be used to find related video
segments.

String search algorithms proposed so far are
in-memory algorithms.
Scan the whole database for each query.
Size of the string database grows faster than the
available memory capacity, and extensive memory
requirements make the search techniques
impractical.
Suffer from disk I/Os when the database is too
large
Performance deteriorates for long query patterns

Similarity Metrics
The difference between two strings s1 and s2 is
generally defined as the minimum number of edit
operations to transform s1 to s2 called edit
distance ED.
Edit operations
Insert
Delete
Replace

Suppose we have two strings x,y
e.g. x kitten, y sitting
and we want to transform x into y.
A closer look
k i t t e n
s i t t i n g
1st step kitten ?sitten (Replace)
2nd step sitten?sittin (Replace)
3rd step sittin?sitting (Insert)s
What is the edit distance between survey and
surgery?
s u r v e y --- s u r g e y replace
(1) --- s u r g e r y insert (1)
Edit distance 2

In the general version of edit distance,
different operations may have different costs, or
the costs depend on the characters involved.
For example replacement could be more expensive
than insertion, or replacing a with o could
be less expensive than replacing a with k.
This is called as weighted edit distance.

Global Alignment
Global alignment (or similarity) of s1 and s2 is
defined as the maximum valued alignment of s1 and
s2.
Given two strings S1 and S2, the global alignment
of them is obtained by inserting spaces into S1
or S2 and at the ends so that are of the same
length and then writing them one against the
other
Example
qacdbd qawdb qac_dbd qa_wdb_
Edits and alignments are dual.
A sequence of edits can be converted into a
global alignment.
An alignment can be converted into a sequence of
edits

Local Alignment
Given two strings X and Y find two substrings x
and y from X and Y, respectively, such that their
alignment score (in the global sense) is maximum
over all pairs of such substrings. (empty
substrings are allowed)
S(x,y) 2 , x y
-2, x ! y
-1, x _ or y _

Xpqraxabcstvq Yyxaxbacsll xaxabcs yaxbacs
a x a b _ c s a x _ b a c s 22-12-1228
12
String Matching Problem

Whole Matching
finding the edit distance ED(q,s) between a data
string s and a query string q.
Substring Matching
Consider all substrings sij of s which are
close to the query string.
Two Types of Queries
Range search seeks all the substrings of S which
are within an edit distance of r to a given query
q (r range query)
K-nearest neighbor search seeks the K closest
substrings of S to q.

13
Challenges in solving the substring matching
problem

Finding the edit distance is very costly in
terms of both time and space.
The strings in the database may be very long.
The database size for most applications grows
exponentially.
New approach to overcome challenges
Define a lower bound distance for substring
searching
Improve this lower bound by using the idea of
wavelet transformation
Use the MRS index structure based on the
aforementioned distance formulations

14
A dynamic programming algorithm for computing the
edit distance

Problem find the edit distance between strings x
and y.
Create a (x1)(y1) matrix C, where Ci,j
represents the minimum number of operations to
match x1..i with y1..j. The matrix is constructed
as follows.
Ci,0 I
C0,j j
Ci,j min(Ci-1,j-1)cost, replace
(Ci,j-1)1, insert
(Ci-1,j)1 delete
cost 0 if xiyi, else 1

15
How do we perform substring search?

The same dynamic programming algorithm can be
used to find the most similar substrings of a
query sting q.
The difference is that we set C0,j0 for all j,
since any text position could be the potential
start of a match.
If the similarity distance bound is k, we report
all positions, where Cm k (m is the last row m
q).

16
Frequency Vector

Let s be a string from the alphabet ??1, ...,
??. Let ni be the number of occurrences of the
character ?i in s for 1?i??, then
frequency vector f(s) n1, ..., n?.
Example
s AATGATAG
f(s) nA, nC, nG, nT 4, 0, 2, 2
Let s be a string from the alphabet ??1, ...,
??. Let f(s) v1, ..., v?, be the frequency
vector of s then ? ?i-1 vi s.
An edit operation on s has one of the following
effects on f(s), for 1 ? i , j ? ?, and i ! j
vi vi 1
vi vi - 1
vi vi 1 and vj vj - 1

17
Effect of Edit Operations on Frequency Vector

Delete decreases an entry by 1.
Insert increases an entry by 1.
Replace Insert Delete
Example
s AATGATAG f(s) 4, 0, 2, 2
(del. G), s AAT.ATAG f(s) 4, 0, 1, 2
(ins. C), s AACTATAG f(s) 4, 1, 1, 2
(A?C), s ACCTATAG f(s) 3, 2, 1, 2

18
Frequency Distance

Let u and v be integer points in ? dimensional
space. The frequency distance, FD 1 (u,v) between
u and v is defined as the minimum number of steps
in order to go from u to v ( or equivalently from
v to u) by moving to a neighbor point at each
step.
frequency vector f(s) n1, ..., n?.
Let s 1 and s 2 be two strings from the alphabet
??1, ..., ?? then
FD 1 (f(s 1), f(s 2)) ? ED (s 1 ,s 2)

19
An Approximation to ED Frequency Distance (FD1)

s AATGATAG f(s)4, 0, 2, 2
q ACTTAGC f(q)2, 2, 1, 2
pos (4-2) (2-1) 3
neg (2-0) 2
FD1(f(s),f(q)) 3
ED(q,s) 4
FD1(f(s1),f(s2))maxpos,neg.
FD1(f(s1),f(s2))? ED(s1,s2).

20
Frequency Distance Calculation/ u and v are ?
dimensional integer points /Algorithm FD 1
(u,v) posDistance negDistance 0For i 1
to ?

FD1(u, v) max posDist, negDist

21
Wavelet Vector ComputationLet s c1c2cn be a
string from the alphabet ??1, ..., ?? then Kth
level wavelet transformation, ?k (s) , 0 log2n of s is defined as ?k (s) vk,1, ...,
vk,n/2k where vk,I Ak,i , Bk,i,

f (ci) k 0
Ak-1,2i Ak-1,2i1 0
0 k 0
Ak-1,2i - Ak-1,2i1 0
0

Ak,i
Bk,i
22
Using Local Information Wavelet Decomposition of
Strings

s AATGATAC f(s)4, 1, 1, 2
s AATG ATAC s1 s2
f(s1) 2, 0, 1, 1
f(s2) 2, 1, 0, 1
?1(s) f(s1)f(s2) 4, 1, 1, 2
?2(s) f(s1)-f(s2) 0, -1, 1, 0

23
Wavelet Decomposition of a String General Idea

Ai,j f(s(j2i (j1)2i-1))
Bi,j Ai-1,2j - Ai-1,2j1

First wavelet coefficient
Second wavelet coefficient
?(s)
24
Wavelet Transformation Example

s T C A C n s 4
?0(s) v0,0 , v0,1 , v0,2 , v0,3
(A0,0, B0,0), (A0,1, B0,1), (A0,2,
B0,2), (A0,3, B0,3)
(f(t), 0), (f(c), 0), (f(a),
0), (f(c), 0)
(0,0,1, 0), (0,1,0, 0), (1,0,0, 0),
(0,1,0, 0)
?1(s) (0,1,1, 0,-1,1), (1,1,0,
1,-1,0)
?2(s) ( 1,2,1, -1,0,1 )

First wavelet coefficient
Second wavelet coefficient
25
Wavelet Distance Calculation
26
Maximum Frequency Distance Calculation

FD(s1,s2)
max FD1(f (s1), f (s2)), FD2(?(s1),?(s2))
FD1 is the Frequency Distance
FD2 is the Wavelet Distance

27
MRS-Index Structure Creation
s1
w2a
28
MRS-Index Structure Creation
s1
29
MRS-Index Structure Creation
s1
30
MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
31
MRS-Index Structure Creation
s1
...
32
MRS-Index Structure Creation
s1
Ta,1
...
W2a
33
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
34
MRS-Index Structure
35
MRS-index properties

Relative MBR volume (Precision) decreases when
c increases.
w decreases.
MBRs are highly clustered.

Box volume
Box Capacity
36
Frequency Distance to an MBRLet q be the query
string of length 2i where a Given an MBR B, we define FD(q,B) min(s belongs
to B) FD(q,s)
37
Range Search Algorithm
38
Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
39
K-Nearest Neighbor Algorithm
40
k-Nearest Neighbor Query
k 3
41
k-Nearest Neighbor Query
k 3
42
k-Nearest Neighbor Query KSF96, SK98
k 3
43
k-Nearest Neighbor Query
r
k 3
r Edit distance to 3rd closest substring
44
Experimental Settings

w128, 256, 512, 1024.
Human chromosomes from (www.ncbi.nlm.nih.gov)
chr02, chr18, chr21, chr22
Plotted results are from chr18 dataset.
Queries are selected from data set randomly for
512 ? q ? 10000.
An NFA based technique BYN99 is implemented for
comparison.

45
Experimental Results 1Effect of Box Capacity
(10-NN)

The cost of the MRS-index increases as the box
capacity increases.
The cost of the MRS-index is much lower than the
NFA technique for all these box capacities.
Although using 2-wavelet coefficient slightly
improves the performance for the same box
capacity, the size of the index structure is
doubled. For same amount of memory, the single
coefficient version performs better

46
Experimental Results 2Effect of Window Size
(10-NN)

The MRS-index structure outperforms the NFA
technique for all the window sizes.
The performance of the MRS index structure itself
improves as the window size increases.

47
Experimental Results 3k-NN queries

The performance of the MRS-index structure drops
for large values of k , it still performs better
than the NFA technique.
Achieved speedups up to 45 for 10 nearest
neighbors. The speedup for 200 nearest neighbors
is 3.
As the number of nearest neighbors increases, the
performance of the MRS-index structure approaches
to that of the NFA technique.

48
Experimental Results 4Range Queries

The MRS-index structure performed up to 12 times
faster than the NFA technique. The performance of
the MRS-index structure improved when the queries
are selected from different data strings. This is
because the DNA strings have a high self
similarity.
The performance of the MRS index structure
deteriorates as the error rate increases. This is
because the size of the candidate set increases
as the error rate increases.

49
Discussion