VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Description:

VGRAM: Improving Performance of Approximate Queries on String Collections Using ... Keanu Reeves. Schwarrzenger. Query errors: Limited knowledge about data. Typos ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 47

Provided by: chenlibinw

Learn more at: https://www.ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

1
VGRAM Improving Performance of Approximate
Queries on String Collections Using
Variable-Length Grams

Chen Li Bin Wang and Xiaochun
Yang

Northeastern University, China
2
Approximate string selections
Keanu Reeves
Samuel Jackson
Schwarzenegger
Samuel Jackson

Schwarrzenger

Query errors
Limited knowledge about data
Typos
Limited input device (cell phone) input
Data errors
Typos
Web data
OCR

Applications
Spellchecking
Query relaxation

3
Approximate string joins
R
S
infromix

mcrosoft

informix
microsoft

Edit distance
Jaccard
Cosine

Record linkage
4
Goal

Reducing index size (memory)
Reducing running time

5
q-grams of strings
u n i v e r s a l
2-grams
6
q-gram inverted lists
7
Searching using inverted lists

Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
8
2-grams -gt 3-grams?

Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
9
Outline

Motivation
VGRAM
Main idea
Decomposing strings to grams
Choosing good grams
Effect of edit operations on grams
Adopting vgram in existing algorithms
Experiments

10
Observation 1 dilemma of choosing q

Increasing q causing
Longer grams ? Shorter lists
Smaller of common grams of similar strings

11
Observation 2 skew distributions of gram
frequencies

DBLP 276,699 article titles
Popular 5-grams ation (gt114K times), tions,
ystem, catio

12
VGRAM Main idea

Grams with variable lengths (between qmin and
qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Advantages
Reduce index size ?
Reducing running time ?
Adoptable by many algorithms ?

13
Challenges

Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?

14
Challenge 1 String ? Variable-length grams?

Fixed-length 2-grams

u n i v e r s a l

Variable-length grams

u n i v e r s a l
15
Representing gram dictionary as a trie
ni ivr sal uni vers
16
Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
17
Step 2 selecting grams

Pruning trie using a frequency threshold T (e.g.,
2)

18
Step 2 selecting grams (cont)
Threshold T 2
19
Final gram dictionary
2,4-grams
20
Outline

Motivation
VGRAM
Main idea
Decomposing strings to grams
Choosing good grams
? Effect of edit operations on grams
Adopting vgram in existing algorithms
Experiments

21
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l

k operations could affect k q grams

22
Deletion affects variable-length grams
Not affected
Not affected
Affected

i
i-qmax1
iqmax- 1
Deletion
23
Grams affected by a deletion
Affected?

i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
24
Grams affected by a deletion (cont)
Affected?

i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
25
of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
26
Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected

Called NAG vector ( of affected grams)
Precomputed and stored

27
Summary of VGRAM index
28
Challenge 4 adopting VGRAM

Easily adoptable by many algorithms
Basic interfaces
String s ? grams
String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams

29
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l

If ed(s1,s2) lt k, then their of common grams
gt
(s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
30
Example algorithm using inverted lists

Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
31
PartEnum VGRAM

PartEnum, fixed q-grams
ed(s1,s2) lt k
? hamming(grams(s1),grams(s2)) lt k q
VGRAM
ed(s1,s2) lt k
? hamming(VG (s1),VG(s2)) lt NAG(s1,k)
NAG(s2,k)

32
PartEnum VGRAM (naïve)
R
S
Bm(S) max(NAG(s,k))
Bm(R) max(NAG(r,k))

Both are using the same gram dictionary.
Use Bm(R) Bm(S) as the new hamming bound.

33
PartEnum VGRAM (optimization)
R
S
R1 with Bm(R1)
R2 with Bm(R2)
Bm(S) max(NAG(s,k))
R3 with Bm(R3)

Group R based on the NAG(r,k) values
Join(R1,S) using Bm(R1) Bm(S)
Similarly, Join(R2,S), Join(R3,S)
Local bounds tighter ? better signatures
generated
Grouping S also possible.

34
Outline

Motivation
VGRAM
Main idea
Decomposing strings to grams
Choosing good grams
Effect of edit operations on grams
Adopting vgram in existing algorithms
Experiments

35
Data sets

Data set 1 Texas Real Estate Commission.
151K person names, average length 33.
Data set 2 English dictionary from the Aspell
spellchecker for Cygwin.
149,165 words, average length 8.
Data set 3 DBLP Bibliography.
277K titles, average length 62.

36
VGRAM overhead (index size)
Dataset 3 DBLP titles
37
VGRAM overhead (construction time)
Dataset 3 DBLP titles
38
Benefits over fixed-length grams (index)
Dataset 1 Person names
39
Benefits over fixed-length grams (running time)
Dataset 1 Person names
40
Effect of qmax
Dataset 1 Person names
41
Effect of frequency threshold T
Dataset 1 Person name
42
Improving algorithm ProbeCount

Dataset 1 Person name
43
Improving algorithm ProbeCluster

Dataset 1 Person name
44
Improving algorithm PartEnum

Dataset 1 Person name
45
Discussions