Liang JinUC Irvine - PowerPoint PPT Presentation

About This Presentation
Title:

Liang JinUC Irvine

Description:

Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586 ... Star Wars: Episode III - Revenge of the Sith. The Matrix. Title. Schwarzenegger. Samuel Jackson ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 33
Provided by: liangjinni
Learn more at: https://ics.uci.edu
Category:
Tags: jinuc | irvine | liang | sith

less

Transcript and Presenter's Notes

Title: Liang JinUC Irvine


1
Indexing Mixed Types for Approximate Retrieval
  • Liang Jin UC Irvine
  • Nick Koudas University of Toronto
  • Chen Li UC Irvine
  • Anthony K.H. Tung National University of
    Singapore

VLDB2005 Liang Jin and Chen Li supported by
NSF CAREER Award IIS-0238586
2
Queries with Mixed-Type Predicates
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
  • SIMLARTO
  • a domain-specific function
  • returns a similarity value between two strings
  • Example edit distance ed(Tom Hanks,
    Ton Hank) 2

3
Why fuzzy predicates?
  • Errors in queries
  • User doesnt remember a string exactly
  • User types a wrong string

4
Problem Formulation
Given A query with fuzzy predicates on strings
and range
predicates on numeric attributes on a
single relation Goal Answer the query
efficiently
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
5
Rest of the talk
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

6
Assumptions
  • One fuzzy string predicate (edit distance)
  • One numeric predicate

(Qs, ds, Qn, dn)
Query
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
(Schwarrzenger, 2, 1980, 5)
7
Intuition of MAT (Mixed-attribute-type) Tree
  • 2 gt 1 1
  • One integrated indexing structure is better than
  • two independent indexing structures on two
    attributes
  • Indexing numeric attributes B-tree or R-tree
  • Indexing strings as a tree to support fuzzy
    predicates?

MAT tree
8
Answering a query (Qs, ds, Qn, dn)
  • Top-down traverse the MAT-tree
  • At each node, do pruning by checking
  • If Qn dn, Qn dn overlap with the numeric
    range.
  • If minEditDistance(Qs, Tn) lt ds.

9
Challenge
  • How to represent strings to fit into a limited
    space
  • and support fuzzy-predicate pruning

Limited space (disk based)
10
Existing Approaches to Indexing Strings as Trees
  • M-tree
  • Edit distance metric space
  • Q-tree
  • Utilize the q-gram property of strings.
  • See our paper for details

11
Representing strings as a trie
12
Compressing a trie
compression
  • Select k representative nodes (centers).
  • Each center is in the format of
    ltalphabet,heightgt.
  • A compressed trie represents more strings

13
Minimum edit distance between a string a trie
  • minEditDistace (Qs, Tn)?
  • Convert a trie to an automaton.
  • Compute the min distance between a string and an
    automaton Myers and Miller, 1989
  • Early termination possible

14
Compressed trie ? Automaton
  • Each node is a state.
  • Each edge becomes a transition between two
    states.
  • For compressed node ltS, Lgt, expand it to L
    levels. At each level, all characters in S become
    single states and are connected to a common tail
    e.

Convert a compressed node lta,b,c,2gt into
automaton nodes.
15
Outline
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

16
Constructing MAT-tree
  • Option 1 insert records one by one.
  • Option 2
  • bulk-load records
  • construct the MAT-tree bottom-up

17
Compressing a trie
  • Important
  • Accurately represent strings in a limited space.
  • Minimize information loss.
  • Maintain the pruning power during a traversal.
  • Three methods
  • (1) Reducing of accepted strings
  • (2) Keeping accepted strings clustered
  • (3) Combining of (1) and (2)

18
Method (1) Reducing of accepted strings
  • Intuition
  • reducing this makes the compressed trie more
    accurate
  • Goodness function of accepted strings
  • Algorithm Randomized
  • Randomly select k initial centers
  • Randomly select one of the centers
  • Randomly select an unselected node
  • Swap them if it can improve the goodness function
  • Do certain of iterations

19
Method (2) Keeping accepted strings clustered
  • Intuition
  • keeping the accepted strings similar to the
    original ones by letting them share common
    prefix.
  • Place k centers as close to the root as possible.
  • Algorithm BreadthFirst

20
Method (3) Combining (1) and (2)
  • Intuition
  • minimize the number of accepted strings, and in
    the same time maintain their similarity to the
    originals.
  • Algorithm Bottomup
  • Keep shrinking the trie bottom up until we have k
    nodes
  • Compress a node that minimizes of additional
    strings

21
Dynamic maintenance
  • Insertion (s, n)
  • Search the index for (s, n). If its not in the
    index, identify the correct leaf node.
  • If no overflow
  • update the MBR of the leaf node and its
    precedents recursively if necessary.
  • If overflow
  • Split the leaf node and
  • Construct two compressed tries
  • Cascade the split to the precedents if necessary.
  • Deletion and Update are handled similarly

22
Outline
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

23
Setting
  • Data
  • IMDB 100K movie star records (Name and YOB).
  • Customers 50K records (Name and YOB)
  • Test bed
  • PC 2.4G P4, 1.2GB Memory, Windows XP
  • Visual C compiler
  • Similar results. Report result for IMDB.

24
Implemented approaches
  • B-tree
  • Q-tree
  • B-tree Q-tree
  • BQ-tree
  • BM-tree
  • Sequential scan
  • BBQ-tree? ?

25
2 gt 1 1
An integrated indexing structure is better than
two separate indexing structures
ds3, dn4
26
Scalability
27
Effect of numeric threshold dn
28
Effect of string threshold ds
29
Dynamic Maintenance time
30
Dynamic maintenance MAT quality
31
Number of centers
  • Increasing cluster may not reduce the running
    time pruning power versus computational cost
  • For BottomUp and BreadthFirst (compared to
    Randomized)
  • - Centers close to the root, thus more likely
    to do early termination

32
Conclusion
  • MAT-tree an efficient indexing structure for
    queries with mixed-type predicates
  • Can be efficiently constructed and maintained
  • Future work develop a uniform framework to
    support different kinds of similarity functions

The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?
Write a Comment
User Comments (0)
About PowerShow.com