VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Description:

VGRAM: Improving Performance of Approximate Queries on String Collections Using ... Keanu Reeves. Schwarrzenger. Query errors: Limited knowledge about data. Typos ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 47
Provided by: chenlibinw
Learn more at: https://www.ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams


1
VGRAM Improving Performance of Approximate
Queries on String Collections Using
Variable-Length Grams
  • Chen Li Bin Wang and Xiaochun
    Yang

Northeastern University, China
2
Approximate string selections
Keanu Reeves
Samuel Jackson
Schwarzenegger
Samuel Jackson

Schwarrzenger
  • Query errors
  • Limited knowledge about data
  • Typos
  • Limited input device (cell phone) input
  • Data errors
  • Typos
  • Web data
  • OCR
  • Applications
  • Spellchecking
  • Query relaxation

3
Approximate string joins
R
S
infromix

mcrosoft

informix
microsoft

  • Edit distance
  • Jaccard
  • Cosine

Record linkage
4
Goal
  • Reducing index size (memory)
  • Reducing running time

5
q-grams of strings
u n i v e r s a l
2-grams
6
q-gram inverted lists
7
Searching using inverted lists
  • Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
8
2-grams -gt 3-grams?
  • Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
9
Outline
  • Motivation
  • VGRAM
  • Main idea
  • Decomposing strings to grams
  • Choosing good grams
  • Effect of edit operations on grams
  • Adopting vgram in existing algorithms
  • Experiments

10
Observation 1 dilemma of choosing q
  • Increasing q causing
  • Longer grams ? Shorter lists
  • Smaller of common grams of similar strings

11
Observation 2 skew distributions of gram
frequencies
  • DBLP 276,699 article titles
  • Popular 5-grams ation (gt114K times), tions,
    ystem, catio

12
VGRAM Main idea
  • Grams with variable lengths (between qmin and
    qmax)
  • zebra
  • ze(123)
  • corrasion
  • co(5213), cor(859), corr(171)
  • Advantages
  • Reduce index size ?
  • Reducing running time ?
  • Adoptable by many algorithms ?

13
Challenges
  • Generating variable-length grams?
  • Constructing a high-quality gram dictionary?
  • Relationship between string similarity and their
    gram-set similarity?
  • Adopting VGRAM in existing algorithms?

14
Challenge 1 String ? Variable-length grams?
  • Fixed-length 2-grams

u n i v e r s a l
  • Variable-length grams

u n i v e r s a l
15
Representing gram dictionary as a trie
ni ivr sal uni vers
16
Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
17
Step 2 selecting grams
  • Pruning trie using a frequency threshold T (e.g.,
    2)

18
Step 2 selecting grams (cont)
Threshold T 2
19
Final gram dictionary
2,4-grams
20
Outline
  • Motivation
  • VGRAM
  • Main idea
  • Decomposing strings to grams
  • Choosing good grams
  • ? Effect of edit operations on grams
  • Adopting vgram in existing algorithms
  • Experiments

21
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

22
Deletion affects variable-length grams
Not affected
Not affected
Affected

i
i-qmax1
iqmax- 1
Deletion
23
Grams affected by a deletion
Affected?

i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
24
Grams affected by a deletion (cont)
Affected?

i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
25
of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
26
Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected
  • Called NAG vector ( of affected grams)
  • Precomputed and stored

27
Summary of VGRAM index
28
Challenge 4 adopting VGRAM
  • Easily adoptable by many algorithms
  • Basic interfaces
  • String s ? grams
  • String s1, s2 such that ed(s1,s2) lt k ? min of
    their common grams

29
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
  • If ed(s1,s2) lt k, then their of common grams
    gt
  • (s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
30
Example algorithm using inverted lists
  • Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
31
PartEnum VGRAM
  • PartEnum, fixed q-grams
  • ed(s1,s2) lt k
  • ? hamming(grams(s1),grams(s2)) lt k q
  • VGRAM
  • ed(s1,s2) lt k
  • ? hamming(VG (s1),VG(s2)) lt NAG(s1,k)
    NAG(s2,k)

32
PartEnum VGRAM (naïve)
R
S
Bm(S) max(NAG(s,k))
Bm(R) max(NAG(r,k))
  • Both are using the same gram dictionary.
  • Use Bm(R) Bm(S) as the new hamming bound.

33
PartEnum VGRAM (optimization)
R
S
R1 with Bm(R1)
R2 with Bm(R2)
Bm(S) max(NAG(s,k))
R3 with Bm(R3)
  • Group R based on the NAG(r,k) values
  • Join(R1,S) using Bm(R1) Bm(S)
  • Similarly, Join(R2,S), Join(R3,S)
  • Local bounds tighter ? better signatures
    generated
  • Grouping S also possible.

34
Outline
  • Motivation
  • VGRAM
  • Main idea
  • Decomposing strings to grams
  • Choosing good grams
  • Effect of edit operations on grams
  • Adopting vgram in existing algorithms
  • Experiments

35
Data sets
  • Data set 1 Texas Real Estate Commission.
  • 151K person names, average length 33.
  • Data set 2 English dictionary from the Aspell
    spellchecker for Cygwin.
  • 149,165 words, average length 8.
  • Data set 3 DBLP Bibliography.
  • 277K titles, average length 62.

36
VGRAM overhead (index size)
Dataset 3 DBLP titles
37
VGRAM overhead (construction time)
Dataset 3 DBLP titles
38
Benefits over fixed-length grams (index)
Dataset 1 Person names
39
Benefits over fixed-length grams (running time)
Dataset 1 Person names
40
Effect of qmax
Dataset 1 Person names
41
Effect of frequency threshold T
Dataset 1 Person name
42
Improving algorithm ProbeCount

Dataset 1 Person name
43
Improving algorithm ProbeCluster

Dataset 1 Person name
44
Improving algorithm PartEnum

Dataset 1 Person name
45
Discussions
  • Dynamic maintenance
  • Edit distance variants
  • Approximate substring queries
  • Block moves
  • Using VGRAM in DBMS

46
Conclusions
  • VGRAM using grams of
  • variable-length
  • high-quality
  • Adoptable in existing algorithms
  • Reduce index size
  • Reduce running time
Write a Comment
User Comments (0)
About PowerShow.com