An Efficient Index Structure for String Databases - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

An Efficient Index Structure for String Databases

Description:

Find similar substrings in a large database, that is similar to a given query ... searching keywords through the net: usually by 'mtallica' we mean 'metallica' ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 51
Provided by: nik49
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Index Structure for String Databases


1
An Efficient Index Structure for String Databases
  • Tamer Kahveci
  • Ambuj K. Singh
  • Presented By
  • Atul Ugalmugale/Nikita Rasam

2
  • Issue ?
  • Find similar substrings in a large database, that
    is similar to a given query string quickly,
    using a small index structure
  • In some applications we store, search and analyze
    long sequences of discrete characters, which we
    call strings
  • There is a frequent need to find similarities
    between genetic data, web data and event
    sequences.

3
  • Applications ?
  • Information Retrieval A typical application of
    information retrieval is text searching given a
    large collection of documents and some text
    keywords we want to find the documents which
    contain these keywords.
  • searching keywords through the net usually by
    mtallica we mean metallica

4
  • Computational Biology The problem is similar in
    computational biology here we have a long DNA
    sequence and we want to find subsequences in it
    that match approximately a query sequence.
  • ATGCATACGATCGATT
  • TGCAATGGCTTAGCTAAnimal species from the same
    family are bound to have more similar DNAs

5
  • Video data can be viewed as an event sequence if
    some pre-specified set of events are detected and
    stored as a sequence. Searching similar event
    subsequences can be used to find related video
    segments.

6
  • String search algorithms proposed so far are
    in-memory algorithms.
  • Scan the whole database for each query.
  • Size of the string database grows faster than the
    available memory capacity, and extensive memory
    requirements make the search techniques
    impractical.
  • Suffer from disk I/Os when the database is too
    large
  • Performance deteriorates for long query patterns

7
  • Similarity Metrics
  • The difference between two strings s1 and s2 is
    generally defined as the minimum number of edit
    operations to transform s1 to s2 called edit
    distance ED.
  • Edit operations
  • Insert
  • Delete
  • Replace

8
  • Suppose we have two strings x,y
  • e.g. x kitten, y sitting
  • and we want to transform x into y.
  • A closer look
  • k i t t e n
  • s i t t i n g
  • 1st step kitten ?sitten (Replace)
  • 2nd step sitten?sittin (Replace)
  • 3rd step sittin?sitting (Insert)s
  • What is the edit distance between survey and
    surgery?
  • s u r v e y --- s u r g e y replace
    (1) --- s u r g e r y insert (1)
  • Edit distance 2

9
  • In the general version of edit distance,
    different operations may have different costs, or
    the costs depend on the characters involved.
  • For example replacement could be more expensive
    than insertion, or replacing a with o could
    be less expensive than replacing a with k.
  • This is called as weighted edit distance.

10
  • Global Alignment
  • Global alignment (or similarity) of s1 and s2 is
    defined as the maximum valued alignment of s1 and
    s2.
  • Given two strings S1 and S2, the global alignment
    of them is obtained by inserting spaces into S1
    or S2 and at the ends so that are of the same
    length and then writing them one against the
    other
  • Example
  • qacdbd qawdb qac_dbd qa_wdb_
  • Edits and alignments are dual.
  • A sequence of edits can be converted into a
    global alignment.
  • An alignment can be converted into a sequence of
    edits

11
  • Local Alignment
  • Given two strings X and Y find two substrings x
    and y from X and Y, respectively, such that their
    alignment score (in the global sense) is maximum
    over all pairs of such substrings. (empty
    substrings are allowed)
  • S(x,y) 2 , x y
  • -2, x ! y
  • -1, x _ or y _

Xpqraxabcstvq Yyxaxbacsll xaxabcs yaxbacs
a x a b _ c s a x _ b a c s 22-12-1228
12
String Matching Problem
  • Whole Matching
  • finding the edit distance ED(q,s) between a data
    string s and a query string q.
  • Substring Matching
  • Consider all substrings sij of s which are
    close to the query string.
  • Two Types of Queries
  • Range search seeks all the substrings of S which
    are within an edit distance of r to a given query
    q (r range query)
  • K-nearest neighbor search seeks the K closest
    substrings of S to q.

13
Challenges in solving the substring matching
problem
  • Finding the edit distance is very costly in
    terms of both time and space.
  • The strings in the database may be very long.
  • The database size for most applications grows
    exponentially.
  • New approach to overcome challenges
  • Define a lower bound distance for substring
    searching
  • Improve this lower bound by using the idea of
    wavelet transformation
  • Use the MRS index structure based on the
    aforementioned distance formulations

14
A dynamic programming algorithm for computing the
edit distance
  • Problem find the edit distance between strings x
    and y.
  • Create a (x1)(y1) matrix C, where Ci,j
    represents the minimum number of operations to
    match x1..i with y1..j. The matrix is constructed
    as follows.
  • Ci,0 I
  • C0,j j
  • Ci,j min(Ci-1,j-1)cost, replace
  • (Ci,j-1)1, insert
  • (Ci-1,j)1 delete
  • cost 0 if xiyi, else 1

15
How do we perform substring search?
  • The same dynamic programming algorithm can be
    used to find the most similar substrings of a
    query sting q.
  • The difference is that we set C0,j0 for all j,
    since any text position could be the potential
    start of a match.
  • If the similarity distance bound is k, we report
    all positions, where Cm k (m is the last row m
    q).

16
Frequency Vector
  • Let s be a string from the alphabet ??1, ...,
    ??. Let ni be the number of occurrences of the
    character ?i in s for 1?i??, then
  • frequency vector f(s) n1, ..., n?.
  • Example
  • s AATGATAG
  • f(s) nA, nC, nG, nT 4, 0, 2, 2
  • Let s be a string from the alphabet ??1, ...,
    ??. Let f(s) v1, ..., v?, be the frequency
    vector of s then ? ?i-1 vi s.
  • An edit operation on s has one of the following
    effects on f(s), for 1 ? i , j ? ?, and i ! j
  • vi vi 1
  • vi vi - 1
  • vi vi 1 and vj vj - 1

17
Effect of Edit Operations on Frequency Vector
  • Delete decreases an entry by 1.
  • Insert increases an entry by 1.
  • Replace Insert Delete
  • Example
  • s AATGATAG f(s) 4, 0, 2, 2
  • (del. G), s AAT.ATAG f(s) 4, 0, 1, 2
  • (ins. C), s AACTATAG f(s) 4, 1, 1, 2
  • (A?C), s ACCTATAG f(s) 3, 2, 1, 2

18
Frequency Distance
  • Let u and v be integer points in ? dimensional
    space. The frequency distance, FD 1 (u,v) between
    u and v is defined as the minimum number of steps
    in order to go from u to v ( or equivalently from
    v to u) by moving to a neighbor point at each
    step.
  • frequency vector f(s) n1, ..., n?.
  • Let s 1 and s 2 be two strings from the alphabet
    ??1, ..., ?? then
  • FD 1 (f(s 1), f(s 2)) ? ED (s 1 ,s 2)

19
An Approximation to ED Frequency Distance (FD1)
  • s AATGATAG f(s)4, 0, 2, 2
  • q ACTTAGC f(q)2, 2, 1, 2
  • pos (4-2) (2-1) 3
  • neg (2-0) 2
  • FD1(f(s),f(q)) 3
  • ED(q,s) 4
  • FD1(f(s1),f(s2))maxpos,neg.
  • FD1(f(s1),f(s2))? ED(s1,s2).

20
Frequency Distance Calculation/ u and v are ?
dimensional integer points /Algorithm FD 1
(u,v) posDistance negDistance 0For i 1
to ?
  • FD1(u, v) max posDist, negDist

21
Wavelet Vector ComputationLet s c1c2cn be a
string from the alphabet ??1, ..., ?? then Kth
level wavelet transformation, ?k (s) , 0 log2n of s is defined as ?k (s) vk,1, ...,
vk,n/2k where vk,I Ak,i , Bk,i,
  • f (ci) k 0
  • Ak-1,2i Ak-1,2i1 0
  • 0 k 0
  • Ak-1,2i - Ak-1,2i1 0
  • 0

Ak,i
Bk,i
22
Using Local Information Wavelet Decomposition of
Strings
  • s AATGATAC f(s)4, 1, 1, 2
  • s AATG ATAC s1 s2
  • f(s1) 2, 0, 1, 1
  • f(s2) 2, 1, 0, 1
  • ?1(s) f(s1)f(s2) 4, 1, 1, 2
  • ?2(s) f(s1)-f(s2) 0, -1, 1, 0

23
Wavelet Decomposition of a String General Idea
  • Ai,j f(s(j2i (j1)2i-1))
  • Bi,j Ai-1,2j - Ai-1,2j1

First wavelet coefficient
Second wavelet coefficient
?(s)
24
Wavelet Transformation Example
  • s T C A C n s 4
  • ?0(s) v0,0 , v0,1 , v0,2 , v0,3
  • (A0,0, B0,0), (A0,1, B0,1), (A0,2,
    B0,2), (A0,3, B0,3)
  • (f(t), 0), (f(c), 0), (f(a),
    0), (f(c), 0)
  • (0,0,1, 0), (0,1,0, 0), (1,0,0, 0),
    (0,1,0, 0)
  • ?1(s) (0,1,1, 0,-1,1), (1,1,0,
    1,-1,0)
  • ?2(s) ( 1,2,1, -1,0,1 )

First wavelet coefficient
Second wavelet coefficient
25
Wavelet Distance Calculation
26
Maximum Frequency Distance Calculation
  • FD(s1,s2)
  • max FD1(f (s1), f (s2)), FD2(?(s1),?(s2))
  • FD1 is the Frequency Distance
  • FD2 is the Wavelet Distance

27
MRS-Index Structure Creation
s1
w2a
28
MRS-Index Structure Creation
s1
29
MRS-Index Structure Creation
s1
30
MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
31
MRS-Index Structure Creation
s1
...
32
MRS-Index Structure Creation
s1
Ta,1
...
W2a
33
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
34
MRS-Index Structure
35
MRS-index properties
  • Relative MBR volume (Precision) decreases when
  • c increases.
  • w decreases.
  • MBRs are highly clustered.

Box volume
Box Capacity
36
Frequency Distance to an MBRLet q be the query
string of length 2i where a Given an MBR B, we define FD(q,B) min(s belongs
to B) FD(q,s)
37
Range Search Algorithm
38
Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
39
K-Nearest Neighbor Algorithm
40
k-Nearest Neighbor Query
k 3
41
k-Nearest Neighbor Query
k 3
42
k-Nearest Neighbor Query KSF96, SK98
k 3
43
k-Nearest Neighbor Query
r
k 3
r Edit distance to 3rd closest substring
44
Experimental Settings
  • w128, 256, 512, 1024.
  • Human chromosomes from (www.ncbi.nlm.nih.gov)
  • chr02, chr18, chr21, chr22
  • Plotted results are from chr18 dataset.
  • Queries are selected from data set randomly for
    512 ? q ? 10000.
  • An NFA based technique BYN99 is implemented for
    comparison.

45
Experimental Results 1Effect of Box Capacity
(10-NN)
  • The cost of the MRS-index increases as the box
    capacity increases.
  • The cost of the MRS-index is much lower than the
    NFA technique for all these box capacities.
  • Although using 2-wavelet coefficient slightly
    improves the performance for the same box
    capacity, the size of the index structure is
    doubled. For same amount of memory, the single
    coefficient version performs better

46
Experimental Results 2Effect of Window Size
(10-NN)
  • The MRS-index structure outperforms the NFA
    technique for all the window sizes.
  • The performance of the MRS index structure itself
    improves as the window size increases.

47
Experimental Results 3k-NN queries
  • The performance of the MRS-index structure drops
    for large values of k , it still performs better
    than the NFA technique.
  • Achieved speedups up to 45 for 10 nearest
    neighbors. The speedup for 200 nearest neighbors
    is 3.
  • As the number of nearest neighbors increases, the
    performance of the MRS-index structure approaches
    to that of the NFA technique.

48
Experimental Results 4Range Queries
  • The MRS-index structure performed up to 12 times
    faster than the NFA technique. The performance of
    the MRS-index structure improved when the queries
    are selected from different data strings. This is
    because the DNA strings have a high self
    similarity.
  • The performance of the MRS index structure
    deteriorates as the error rate increases. This is
    because the size of the candidate set increases
    as the error rate increases.

49
Discussion
  • In-memory (index size is 1-2 of the database
    size).
  • Lossless search.
  • 3 to 45 times faster than NFA technique for k-NN
    queries.
  • 2 to 12 times faster than NFA technique for range
    queries.
  • Can be used to speedup any previously defined
    technique.

50
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com