Title: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
1VGRAM Improving Performance of Approximate
Queries on String Collections Using
Variable-Length Grams
- Chen Li Bin Wang and Xiaochun
Yang
Northeastern University, China
2Approximate string selections
Keanu Reeves
Samuel Jackson
Schwarzenegger
Samuel Jackson
Schwarrzenger
- Query errors
- Limited knowledge about data
- Typos
- Limited input device (cell phone) input
- Data errors
- Typos
- Web data
- OCR
- Applications
- Spellchecking
- Query relaxation
3Approximate string joins
R
S
infromix
mcrosoft
informix
microsoft
- Edit distance
- Jaccard
- Cosine
Record linkage
4Goal
- Reducing index size (memory)
- Reducing running time
5q-grams of strings
u n i v e r s a l
2-grams
6q-gram inverted lists
7Searching using inverted lists
- Query shtick, ED(shtick, ?)1
ic
ck
sh ht ti ic ck
ti
2-grams
82-grams -gt 3-grams?
- Query shtick, ED(shtick, ?)1
ick
sht hti tic ick
tic
of common grams gt 1
3-grams
9Outline
- Motivation
- VGRAM
- Main idea
- Decomposing strings to grams
- Choosing good grams
- Effect of edit operations on grams
- Adopting vgram in existing algorithms
- Experiments
10Observation 1 dilemma of choosing q
- Increasing q causing
- Longer grams ? Shorter lists
- Smaller of common grams of similar strings
11Observation 2 skew distributions of gram
frequencies
- DBLP 276,699 article titles
- Popular 5-grams ation (gt114K times), tions,
ystem, catio
12VGRAM Main idea
- Grams with variable lengths (between qmin and
qmax) - zebra
- ze(123)
- corrasion
- co(5213), cor(859), corr(171)
- Advantages
- Reduce index size ?
- Reducing running time ?
- Adoptable by many algorithms ?
13Challenges
- Generating variable-length grams?
- Constructing a high-quality gram dictionary?
- Relationship between string similarity and their
gram-set similarity? - Adopting VGRAM in existing algorithms?
14Challenge 1 String ? Variable-length grams?
u n i v e r s a l
u n i v e r s a l
15Representing gram dictionary as a trie
ni ivr sal uni vers
16Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
17Step 2 selecting grams
- Pruning trie using a frequency threshold T (e.g.,
2)
18Step 2 selecting grams (cont)
Threshold T 2
19Final gram dictionary
2,4-grams
20Outline
- Motivation
- VGRAM
- Main idea
- Decomposing strings to grams
- Choosing good grams
- ? Effect of edit operations on grams
- Adopting vgram in existing algorithms
- Experiments
21Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
22Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
23Grams affected by a deletion
Affected?
i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
24Grams affected by a deletion (cont)
Affected?
i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
25 of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
26Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected
- Called NAG vector ( of affected grams)
- Precomputed and stored
27Summary of VGRAM index
28Challenge 4 adopting VGRAM
- Easily adoptable by many algorithms
- Basic interfaces
- String s ? grams
- String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams
29Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
- If ed(s1,s2) lt k, then their of common grams
gt - (s1- q 1) k q
Variable lengths of grams of s1 NAG(s1,k)
30Example algorithm using inverted lists
- Query shtick, ED(shtick, ?)1
sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
31PartEnum VGRAM
- PartEnum, fixed q-grams
- ed(s1,s2) lt k
- ? hamming(grams(s1),grams(s2)) lt k q
- VGRAM
- ed(s1,s2) lt k
- ? hamming(VG (s1),VG(s2)) lt NAG(s1,k)
NAG(s2,k)
32PartEnum VGRAM (naïve)
R
S
Bm(S) max(NAG(s,k))
Bm(R) max(NAG(r,k))
- Both are using the same gram dictionary.
- Use Bm(R) Bm(S) as the new hamming bound.
33PartEnum VGRAM (optimization)
R
S
R1 with Bm(R1)
R2 with Bm(R2)
Bm(S) max(NAG(s,k))
R3 with Bm(R3)
- Group R based on the NAG(r,k) values
- Join(R1,S) using Bm(R1) Bm(S)
- Similarly, Join(R2,S), Join(R3,S)
- Local bounds tighter ? better signatures
generated - Grouping S also possible.
34Outline
- Motivation
- VGRAM
- Main idea
- Decomposing strings to grams
- Choosing good grams
- Effect of edit operations on grams
- Adopting vgram in existing algorithms
- Experiments
35Data sets
- Data set 1 Texas Real Estate Commission.
- 151K person names, average length 33.
- Data set 2 English dictionary from the Aspell
spellchecker for Cygwin. - 149,165 words, average length 8.
- Data set 3 DBLP Bibliography.
- 277K titles, average length 62.
36VGRAM overhead (index size)
Dataset 3 DBLP titles
37VGRAM overhead (construction time)
Dataset 3 DBLP titles
38Benefits over fixed-length grams (index)
Dataset 1 Person names
39Benefits over fixed-length grams (running time)
Dataset 1 Person names
40Effect of qmax
Dataset 1 Person names
41Effect of frequency threshold T
Dataset 1 Person name
42Improving algorithm ProbeCount
Dataset 1 Person name
43Improving algorithm ProbeCluster
Dataset 1 Person name
44Improving algorithm PartEnum
Dataset 1 Person name
45Discussions
- Dynamic maintenance
- Edit distance variants
- Approximate substring queries
- Block moves
- Using VGRAM in DBMS
46Conclusions
- VGRAM using grams of
- variable-length
- high-quality
- Adoptable in existing algorithms
- Reduce index size
- Reduce running time