Index Structures for String Databases - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Index Structures for String Databases

Description:

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T. Problem Definition ... Ex: GenBank (NCBI) doubles every 15 months. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 41
Provided by: Alex355
Category:

less

Transcript and Presenter's Notes

Title: Index Structures for String Databases


1
Index Structures for String Databases
  • Alexandra Martinez
  • Computational Molecular Biology
  • CISE, University of Florida
  • Spring 2004

2
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • Future Work

3
Problem Definition
  • Substring searching in databases
  • Ex Similarity of two DNA strings
  • Functional relationships

Query Q
T C G A T T A C A G T G A A T
Database S
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
4
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • Future Work

5
Motivation for Indexing
  • Very large databases, exponential growth
  • Ex GenBank (NCBI) doubles every 15 months.
  • Most string search algorithms are in-memory
    algorithms
  • Scan the whole database for each query.
  • Suffer from disk I/O when db is too large.
  • For index-based techniques, size of index
    structure is larger than size of db
  • Index size exceeds memory size -gt resides on disk
  • Performance deteriorates for long queries.
  • Need efficient external memory algorithms!

6
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • Future Work

7
Background Edit Distance
  • Edit Operations
  • Insert, Delete, Replace
  • Edit Distance between s1 and s2
  • Minimum number of operations to transform s1 to
    s2.
  • ED (ABC, ABDA) 2
  • Time space complexity is O(mn) using dynamic
    programming.

8
Background Alignments
  • Alignment
  • Matches chars of s1 and s2 increasingly.
  • Each char pair pk is assigned a score s(pk).
  • A C T - - T A G C
  • R I I D
  • A A T G A T A G -
  • Global Alignment
  • Maximum alignment value of s1 and s2.
  • Local Alignment
  • Highest alignment value of all the substrings of
    s1 and s2.
  • Ex BLAST

Alignment Value Sk s(pk)
9
Background Query Types
  • String database S s1, s2, , sd
  • Range Queries
  • Seek all substrings of S that are within an edit
    distance of r (range) to the input query q.
  • Nearest Neighbor Queries (kNN)
  • Seek the k closest substrings of S to the input
    query q.

10
Background Wavelets
  • Wavelet transform provides a time-frequency
    representation of a signal.
  • WT developed as alternative to STFT.
  • WT solves resolution problem
  • Narrow window gt good time resolution, poor
    frequency resolution.
  • Wide window gt good frequency resolution, poor
    time resolution (w?8FT)
  • Complexity of WT is O(N).

Short Time Fourier Transform
11
Background Wavelets (2)
Original time signal
STFT with small window
STFT with large window
Wavelet Transform
12
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • General Approach
  • Frequency Vector Distance
  • Wavelet Transform Distance
  • MRS Index Structure
  • Range Nearest Neighbor Queries
  • Future Work

13
General Approach
  • Map the substrings of the db into an integer
    space.
  • Frequency Vector
  • Vector of Wavelet Coefficients
  • Define a distance function in this integer space,
    which is lower bound of the actual edit distance.
  • Cluster the vectors of consecutive substrings
    into Minimum Bounding Rectangles (MBRs).
  • Obtain an array of MBRs for different resolutions
    -gt grid.

14
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • General Approach
  • Frequency Vector Distance
  • Wavelet Transform Distance
  • MRS Index Structure
  • Range Nearest Neighbor Queries
  • Future Work

15
Frequency Vector
  • s string from alphabet ??1, ..., ??
  • ni number of occurrences of ?i in s (1 ? i ?
    ?)
  • Define the frequency vector of s as f(s)n1,
    ..., n?
  • Example
  • s AATGATAG
  • f(s) nA, nC, nG, nT 4, 0, 2, 2

16
Effect of Edit Ops on the Frequency Vector
  • Delete decreases an entry by 1
  • Insert increases an entry by 1
  • Replace Insert Delete
  • Example A C G T
  • s AATGATAG ? f(s) 4, 0, 2, 2
  • Del G s AAT.ATAG ? f(s) 4, 0, 1, 2
  • Ins C s AACTATAG ? f(s) 4, 1, 1, 2
  • A?C s ACCTATAG ? f(s) 3, 2, 1, 2

17
Frequency Distance (FD1) A Lower Bound on the ED
  • Define FD1(u, v) as the minimum number of steps
    in order to go from u to v (or viceversa) by
    moving to a neighbor point at each step.
  • Two points u and v in sdim space are neighbors if
    one of them can be obtained from the other by a
    single edit operation.

18
Frequency Distance Example
  • s AATGATAG gt f(s)4, 0, 2, 2
  • t ACTTAGC gt f(t)2, 2, 1, 2
  • pos (4-2) (2-1) 3
  • neg (2-0) 2
  • FD1(f(s), f(t)) 3
  • ED(s, t) 4
  • FD1( f(s), f(t) ) maxpos, neg
  • FD1( f(s), f(t) ) ? ED(s, t)

19
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • General Approach
  • Frequency Vector Distance
  • Wavelet Transform Distance
  • MRS Index Structure
  • Range Nearest Neighbor Queries
  • Future Work

20
Wavelet Transformation Example
  • s T C A C n s 4
  • ?0(s) v0,0 , v0,1 , v0,2 , v0,3
  • (A0,0, B0,0), (A0,1, B0,1), (A0,2,
    B0,2), (A0,3, B0,3)
  • (f(t), 0), (f(c), 0), (f(a),
    0), (f(c), 0)
  • (0,0,0,1, 0), (0,1,0,0, 0), (1,0,0,0,
    0), (0,1,0,0, 0)
  • ?1(s) (0,1,0,1, 0,-1,0,1),
    (1,1,0,0, 1,-1,0,0)
  • ?2(s) ( 1,2,0,1, -1,0,0,1
    )

First wavelet coefficient
Second wavelet coefficient
21
Wavelet Transformation String Decomposition
  • Ak,i Ak-1,2i Ak-1,2i1 0ltklt(log2n)-1
  • Bk,i Ak-1,2i - Ak-1,2i1 0ltilt(n/2k)-1

i
k
First wavelet coefficient
Second wavelet coefficient
?(s)
22
Wavelet Distance (FD2) A Lower Bound on the ED
  • Maximum Frequency Distance FD(s1,s2)
  • max FD1(f(s1), f(s2)), FD2(?(s1),?(s2))

23
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • General Approach
  • Frequency Vector Distance
  • Wavelet Transform Distance
  • MRS Index Structure
  • Range Nearest Neighbor Queries
  • Future Work

24
MRS Index Creation
s1
w2a
MBR
25
MRS Index Creation
s1
transform
26
MRS Index Creation
s1
MBR
27
MRS Index Creation
s1
...
slide c times
cbox capacity
MBR
28
MRS Index Creation
s1
...
MBRs containing wavelet coefficients of
substrings of s1
29
MRS Index Creation
s1
Ta,1
...
W2a
Tree of MBRs for a resolution of W2a over s1
30
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
31
MRS Index Structure
j
1jd
Database
Resolution levels
Ti,j index for j th string and window size 2i
i
aib
32
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • General Approach
  • Frequency Vector Distance
  • Wavelet Transform Distance
  • MRS Index Structure
  • Range Nearest Neighbor Queries
  • Future Work

33
Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
34
k-Nearest Neighbor Queries Phase 1
k 3
B set of k closest MBRs to query string q.
35
k-Nearest Neighbor Queries Phase 1
k 3
B set of k closest MBRs to query string q.
r kth smallest edit distance of strings in B to
q.
36
k-Nearest Neighbor QueriesPhase 2
r
k 3
Perform a range query using r as the query radius.
37
k-Nearest Neighbor Query(2)
k 3
Perform a range query using r as the query radius.
38
Outline
  • Problem Definition
  • Motivation
  • Background
  • Proposed Solution
  • Future Work

39
Future Work
  • Adapt the MRS-Index to work as an external
    indexing over tuples of a database.
  • Evaluate and compare the performance of the two
    distance functions, FD1 and FD2.
  • Test with protein sequences rather than DNA
    sequences.

40
References
  • T. Kahveci, A. K. Singh. Efficient Index
    Structures for String Databases. VLDB 2001
    351-360.
  • O. Camoglu, T. Kahveci, A. K. Singh. PSI
    Indexing Protein Structures for Fast Similarity
    Search. Bioinformatics, 11, pages 1-3, 2003.
  • O. Camoglu, T. Kahveci, A. K. Singh. PSI
    Indexing Protein Structures for Fast Similarity
    Search. 2003.
  • R. Polikar. The Wavelet Tutorial.
    http//users.rowan.edu/polikar/WAVELETS/
Write a Comment
User Comments (0)
About PowerShow.com