Liang JinUC Irvine - PowerPoint PPT Presentation

Loading...

PPT – Liang JinUC Irvine PowerPoint presentation | free to download - id: 61ffb-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Liang JinUC Irvine

Description:

The Terminator. Star Wars: Episode III - Revenge of the Sith. The Matrix. Title. Schwarzenegger ... IMDB: 100K movie star records (Name and YOB). Customers: ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 37
Provided by: liang92
Learn more at: http://vldb.idi.ntnu.no
Category:
Tags: jinuc | irvine | liang

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Liang JinUC Irvine


1
Indexing Mixed Types for Approximate Retrieval
  • Liang Jin UC Irvine
  • Nick Koudas University of Toronto
  • Chen Li UC Irvine
  • Anthony K.H. Tung National University of
    Singapore

Liang Jin and Chen Li supported by NSF CAREER
Award IIS-0238586
2
Queries with Mixed-Type Predicates
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
  • SIMLARTO
  • a domain-specific function
  • returns a similarity value between two strings
  • Example edit distance ed(Tom Hanks,
    Ton Hank) 2

3
Why fuzzy predicates?
  • Errors in queries
  • User doesnt remember a string exactly
  • User types a wrong string

4
Problem Formulation
Given A query with fuzzy predicates on strings
and range
predicates on numeric attributes on a
single relation Goal Answer the query
efficiently
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
5
Rest of the talk
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

6
Assumptions
  • One fuzzy string predicate (edit distance)
  • One numeric predicate

(Qs, ds, Qn, dn)
Query
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
(Schwarrzenger, 2, 1980, 5)
7
Intuition of MAT (Mixed-attribute-type) Tree
  • 2 gt 1 1
  • One integrated indexing structure is better than
  • two independent indexing structures on two
    attributes
  • Indexing numeric attributes B-tree or R-tree
  • Indexing strings as a tree to support fuzzy
    predicates?

MAT tree
8
Answering a query (Qs, ds, Qn, dn)
  • Top-down traverse the MAT-tree
  • At each node, do pruning by checking
  • If Qn dn, Qn dn overlap with the numeric
    range.
  • If minEditDistance(Qs, Tn) lt ds.

9
Challenge
  • How to represent strings to fit into a limited
    space
  • and support fuzzy-predicate pruning

Limited space (disk based)
10
Existing Approaches to Indexing Strings as Trees
  • M-tree
  • Edit distance metric space
  • Q-tree
  • Utilize the q-gram property of strings.
  • See our paper for details

11
Representing strings as a trie
12
Compressing a trie
compression
  • Select k representative nodes (centers).
  • Each center is in the format of
    ltalphabet,heightgt.
  • A compressed trie represents more strings

13
Minimum edit distance between a string a trie
  • minEditDistace (Qs, Tn)?
  • Convert a trie to an automaton.
  • Compute the min distance between a string and an
    automaton Myers and Miller, 1989
  • Early termination possible

14
Compressed trie ? Automaton
  • Each node is a state.
  • Each edge becomes a transition between two
    states.
  • For compressed node ltS, Lgt, expand it to L
    levels. At each level, all characters in S become
    single states and are connected to a common tail
    e.

Convert a compressed node lta,b,c,2gt into
automaton nodes.
15
Outline
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

16
Constructing MAT-tree
  • Option 1 insert records one by one.
  • Option 2
  • bulk-load records
  • construct the MAT-tree bottom-up

17
Compressing a trie
  • Important
  • Accurately represent strings in a limited space.
  • Minimize information loss.
  • Maintain the pruning power during a traversal.
  • Three methods
  • (1) Reducing of accepted strings
  • (2) Keeping accepted strings clustered
  • (3) Combining of (1) and (2)

18
Method (1) Reducing of accepted strings
  • Intuition
  • reducing this makes the compressed trie more
    accurate
  • Goodness function of accepted strings
  • Algorithm Randomized
  • Randomly select k initial centers
  • Randomly select one of the centers
  • Randomly select an unselected node
  • Swap them if it can improve the goodness function
  • Do certain of iterations

19
Method (2) Keeping accepted strings clustered
  • Intuition
  • keeping the accepted strings similar to the
    original ones by letting them share common
    prefix.
  • Place k centers as close to the root as possible.
  • Algorithm BreadthFirst

20
Method (3) Combining (1) and (2)
  • Intuition
  • minimize the number of accepted strings, and in
    the same time maintain their similarity to the
    originals.
  • Algorithm Bottomup
  • Keep shrinking the trie bottom up until we have k
    nodes
  • Compress a node that minimizes of additional
    strings

21
Dynamic maintenance
  • Insertion (s, n)
  • Search the index for (s, n). If its not in the
    index, identify the correct leaf node.
  • If no overflow
  • update the MBR of the leaf node and its
    precedents recursively if necessary.
  • If overflow
  • Split the leaf node and
  • Construct two compressed tries
  • Cascade the split to the precedents if necessary.
  • Deletion and Update are handled similarly

22
Outline
  • Motivation supporting queries with mixed-type
    predicates
  • Our approach MAT tree
  • Construction and maintenance of MAT tree
  • Experiments

23
Setting
  • Data
  • IMDB 100K movie star records (Name and YOB).
  • Customers 50K records (Name and YOB)
  • Test bed
  • PC 2.4G P4, 1.2GB Memory, Windows XP
  • Visual C compiler
  • Similar results. Report result for IMDB.

24
Implemented approaches
  • B-tree
  • Q-tree
  • B-tree Q-tree
  • BQ-tree
  • BM-tree
  • Sequential scan
  • BBQ-tree? ?

25
2 gt 1 1
An integrated indexing structure is better than
two separate indexing structures
ds3, dn4
26
Scalability
27
Effect of numeric threshold dn
28
Effect of string threshold ds
29
Dynamic Maintenance time
30
Dynamic maintenance MAT quality
31
Number of centers
  • Increasing cluster may not reduce the running
    time pruning power versus computational cost
  • For BottomUp and BreadthFirst (compared to
    Randomized)
  • - Centers close to the root, thus more likely
    to do early termination

32
Conclusion
  • MAT-tree an efficient indexing structure for
    queries with mixed-type predicates
  • Can be efficiently constructed and maintained
  • Future work develop a uniform framework to
    support different kinds of similarity functions

The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?
33
Backup Slides
34
Constructing MAT-tree
  • Option 1 inserting records one by one.
  • Option 2 bulk-loading data records and
    constructing the MAT-tree in a bottom-up fashion.
  • Records are sorted based on one attribute.
  • Fill pages with records until full.
  • Calculate the numeric range and the compressed
    trie for each leaf nodes.
  • Merge leaf nodes into internal nodes recursively
    according to desired fanout, until a single root
    is formed.

35
Example Customer Service Call Center
Customer calls in
Serve the customer
Issue a fuzzy query Name LIKE Tom Hanks AND
YOB CLOSE to 1958
In this example, the underline system should be
able to support fuzzy query on both the string
and numeric attributes!
Return result
36
Scalability test (IO)
About PowerShow.com