Liang JinUC Irvine - PowerPoint PPT Presentation

About This Presentation
Title:

Liang JinUC Irvine

Description:

Star Wars: Episode III - Revenge of the Sith. The Matrix. Title. Schwarzenegger. Samuel Jackson ... At each level, all characters in S become single states and ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: liangjinni
Category:

less

Transcript and Presenter's Notes

Title: Liang JinUC Irvine


1
Indexing Mixed Types for Approximate Retrieval
  • Liang Jin UC Irvine
  • Nick Koudas University of Toronto
  • Chen Li UC Irvine
  • Anthony K.H. Tung National University of
    Singapore

VLDB2005 Liang Jin and Chen Li supported by
NSF CAREER Award IIS-0238586
2
Queries with Mixed-Type Predicates
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980
  • SIMLARTO
  • a domain-specific function
  • returns a similarity value between two strings
  • Example edit distance ed(Tom Hanks,
    Ton Hank) 2

  • 3
    Why fuzzy predicates?
    • Errors in queries
    • User doesnt remember a string exactly
    • User types a wrong string

    4
    Problem Formulation
    Given A query with fuzzy predicates on strings
    and range
    predicates on numeric attributes on a
    single relation Goal Answer the query
    efficiently
    SELECT FROM Movies WHERE star SIMILARTO
    Schwarrzenger AND year 1980 5
    Rest of the talk
    • Motivation supporting queries with mixed-type
      predicates
    • Our approach MAT tree
    • Construction and maintenance of MAT tree
    • Experiments

    6
    Assumptions
    • One fuzzy string predicate (edit distance)
    • One numeric predicate

    (Qs, ds, Qn, dn)
    Query
    SELECT FROM Movies WHERE star SIMILARTO
    Schwarrzenger AND year 1980
    (Schwarrzenger, 2, 1980, 5)
    7
    Intuition of MAT (Mixed-attribute-type) Tree
    • 2 1 1
    • One integrated indexing structure is better than
    • two independent indexing structures on two
      attributes
    • Indexing numeric attributes B-tree or R-tree
    • Indexing strings as a tree to support fuzzy
      predicates?

    MAT tree
    8
    Answering a query (Qs, ds, Qn, dn)
    • Top-down traverse the MAT-tree
    • At each node, do pruning by checking
    • If Qn dn, Qn dn overlap with the numeric
      range.
    • If minEditDistance(Qs, Tn)

    9
    Challenge
    • How to represent strings to fit into a limited
      space
    • and support fuzzy-predicate pruning

    Limited space (disk based)
    10
    Existing Approaches to Indexing Strings as Trees
    • M-tree
    • Edit distance metric space
    • Q-tree
    • Utilize the q-gram property of strings.
    • See our paper for details

    11
    Representing strings as a trie
    12
    Compressing a trie
    compression
    • Select k representative nodes (centers).
    • Each center is in the format of
      .
    • A compressed trie represents more strings

    13
    Minimum edit distance between a string a trie
    • minEditDistace (Qs, Tn)?
    • Convert a trie to an automaton.
    • Compute the min distance between a string and an
      automaton Myers and Miller, 1989
    • Early termination possible

    14
    Compressed trie ? Automaton
    • Each node is a state.
    • Each edge becomes a transition between two
      states.
    • For compressed node , expand it to L
      levels. At each level, all characters in S become
      single states and are connected to a common tail
      e.

    Convert a compressed node into
    automaton nodes.
    15
    Outline
    • Motivation supporting queries with mixed-type
      predicates
    • Our approach MAT tree
    • Construction and maintenance of MAT tree
    • Experiments

    16
    Constructing MAT-tree
    • Option 1 insert records one by one.
    • Option 2
    • bulk-load records
    • construct the MAT-tree bottom-up

    17
    Compressing a trie
    • Important
    • Accurately represent strings in a limited space.
    • Minimize information loss.
    • Maintain the pruning power during a traversal.
    • Three methods
    • (1) Reducing of accepted strings
    • (2) Keeping accepted strings clustered
    • (3) Combining of (1) and (2)

    18
    Method (1) Reducing of accepted strings
    • Intuition
    • reducing this makes the compressed trie more
      accurate
    • Goodness function of accepted strings
    • Algorithm Randomized
    • Randomly select k initial centers
    • Randomly select one of the centers
    • Randomly select an unselected node
    • Swap them if it can improve the goodness function
    • Do certain of iterations

    19
    Method (2) Keeping accepted strings clustered
    • Intuition
    • keeping the accepted strings similar to the
      original ones by letting them share common
      prefix.
    • Place k centers as close to the root as possible.
    • Algorithm BreadthFirst

    20
    Method (3) Combining (1) and (2)
    • Intuition
    • minimize the number of accepted strings, and in
      the same time maintain their similarity to the
      originals.
    • Algorithm Bottomup
    • Keep shrinking the trie bottom up until we have k
      nodes
    • Compress a node that minimizes of additional
      strings

    21
    Dynamic maintenance
    • Insertion (s, n)
    • Search the index for (s, n). If its not in the
      index, identify the correct leaf node.
    • If no overflow
    • update the MBR of the leaf node and its
      precedents recursively if necessary.
    • If overflow
    • Split the leaf node and
    • Construct two compressed tries
    • Cascade the split to the precedents if necessary.
    • Deletion and Update are handled similarly

    22
    Outline
    • Motivation supporting queries with mixed-type
      predicates
    • Our approach MAT tree
    • Construction and maintenance of MAT tree
    • Experiments

    23
    Setting
    • Data
    • IMDB 100K movie star records (Name and YOB).
    • Customers 50K records (Name and YOB)
    • Test bed
    • PC 2.4G P4, 1.2GB Memory, Windows XP
    • Visual C compiler
    • Similar results. Report result for IMDB.

    24
    Implemented approaches
    • B-tree
    • Q-tree
    • B-tree Q-tree
    • BQ-tree
    • BM-tree
    • Sequential scan
    • BBQ-tree? ?

    25
    2 1 1
    An integrated indexing structure is better than
    two separate indexing structures
    ds3, dn4
    26
    Scalability
    27
    Effect of numeric threshold dn
    28
    Effect of string threshold ds
    29
    Dynamic Maintenance time
    30
    Dynamic maintenance MAT quality
    31
    Number of centers
    • Increasing cluster may not reduce the running
      time pruning power versus computational cost
    • For BottomUp and BreadthFirst (compared to
      Randomized)
    • - Centers close to the root, thus more likely
      to do early termination

    32
    Conclusion
    • MAT-tree an efficient indexing structure for
      queries with mixed-type predicates
    • Can be efficiently constructed and maintained
    • Future work develop a uniform framework to
      support different kinds of similarity functions

    The Flamingo Project http//www.ics.uci.edu/fla
    mingo/
    QA?
    Write a Comment
    User Comments (0)
    About PowerShow.com