Akindex : Exploiting Local Similarity to Index Paths in Graph Data - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Akindex : Exploiting Local Similarity to Index Paths in Graph Data

Description:

Akindex : Exploiting Local Similarity to Index Paths in Graph Data – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 49
Provided by: rag8
Category:

less

Transcript and Presenter's Notes

Title: Akindex : Exploiting Local Similarity to Index Paths in Graph Data


1
A(k)-index Exploiting Local Similarity to Index
Paths in Graph Data
  • Raghav Kaushik (UW)
  • Pradeep Shenoy (UWash)Philip Bohannon (Bell
    Labs)Ehud Gudes (BGU)

2
Outline
  • Problem statement
  • Prior work and limitations
  • Background
  • A(k)-index
  • Query Evaluation
  • Preliminary experiments
  • Update
  • Conclusions

3
Data Model
  • Rooted, node-labeled graph with unique root root
    has unique label
  • Nodes - objects
  • Arcs - object-subobject relationship
  • In XML context
  • Index tag structure
  • No distinction between elements and attributes
  • No distinction between tree and idref arcs
  • Order ignored

4
Problem Statement
  • Practical indexing schemes for large graph data
    (like XML data) (100K - 1M nodes)
  • Size 10 of database size
  • Efficient construction and update
  • Tunable to a workload
  • Queries of the form R x, where R is a regular
    path expression
  • Schemaless data

5
Flavor of Approach
  • Different from traditional value indices
  • Structural summaries for indexing paths
  • Both data and index are rooted graphs
  • Example Dataguide

6
Index Graph
  • Structural summary
  • Associate a set of data nodes with each index
    node, called its extent
  • Preserve data paths in index graph

7
Example index graph
0
0
2
1
2
1
3,4
4
3
5,6
6
5
Data graph Index graph
8
Index Graph (contd)
  • Can be constructed from any partition
  • Node for every equivalence class C
  • Edge between C and C if exists an edge v v
    with v in C and v in C
  • Preserves data paths, no false drops
  • Our structures are all index graphs

9
Prior Schemes
  • Dataguide Goldman, Widom 1997
  • Deterministic automaton corresponding to data
    graph
  • Each set of data nodes that can be distinguished
    by a path query is summarized by a single node in
    the index
  • Can be exponential in size!

10
Prior Schemes (contd)
  • 1-index Milo, Suciu 1999
  • NFA rather than DFA (smaller)
  • split graph nodes into equivalence classes based
    on incoming paths from the root
  • Computing best split is PSPACE complete
  • Go for refinements (approximations)
  • similarity
  • bisimilarity

11
Limitations of Prior Work
  • Size
  • Dataguide sizes subject to exponential blow-up
  • 1-index size can be big too!
  • Update
  • No known update algorithm for 1-index
  • Designed to answer queries involving arbitrarily
    complex paths, but...
  • such paths may never show up in queries

12
Local Similarity
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
13
Main Contributions
  • New family of approximate index structures
  • Applicable to
  • Approximate Schema
  • Statistics
  • Query evaluation using approximate indexes
  • Preliminary performance study
  • Update algorithms

14
Approximate Indexes
  • Motivation
  • Smaller
  • More efficient query processing
  • Limited update cost - maintain local information
  • Approximate dataguide Goldman, et.al
  • path merging, object matching, etc
  • no formal basis (but different goal)
  • no study of effect on query processing

15
Outline
  • Problem statement
  • Prior work and limitations
  • Background
  • A(k)-index
  • Query Evaluation
  • Preliminary experiments
  • Update
  • Conclusions

16
Graph Bisimulation
  • A bisimulation is a symmetric relation R between
    nodes
  • If A1 R A2 then
  • A1 and A2 have the same labels
  • and ...

17
Graph Bisimulation (contd)
and vice-versa!
18
Bisimilarity
  • Two nodes a and b are bisimilar if they are
    related in some bisimulation
  • 1-index is index graph constructed from
    bisimulation partition
  • Simulation partition similar

19
Bisimulation on example
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
20
k-bisimulation
  • Nodes A1 and A2 are 0-bisimilar iff same label
  • A1 and A2 are k-bisimilar iff
  • k-1 bisimilar and
  • if (B1, A1), exists (B2, A2) B1 and B2 are k-1
    bisimilar, and vice versa

21
Example for k-bisimulation
22
A(2) for example
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
23
Properties
  • If a and b are bisimilar
  • set of incoming paths into them is same
  • If a and b are k-similar or k-bisimilar
  • set of incoming paths of length
  • If k-bisim k1-bisim then k-bisim bisim
  • Size certainly smaller than bisimulation

24
Query Evaluation
  • Only queries studied are regular path queries of
    the form R x
  • Query Evaluation Approach
  • Create automaton for regexp query
  • Run automaton on the index graph
  • Result is union of extents belonging to index
    nodes accepted by automaton

25
Example Query Evaluation
Automaton Graph
Index Graph
26
Approximate Indexes
  • Caveat False positives possible
  • Approach verify each node on data graph by
    running reverse automaton
  • Prohibitive cost?
  • Then why use approx. indices?
  • In fact, frequently more efficient than data
    graph or precise index

27
Improving Validation
  • First cut Keep track of accepting-path-length
  • for accepted nodes with path length verification not required
  • Second step Share traversals among verification
    calls
  • mark node-state pairs on a successful
    verification path as accept
  • similar marking for failed path

28
Improving Validation (contd)
  • Third Step Avoid needless verification
  • Example For _.R queries, no need to verify all
    the way up to the root
  • Generalize the above!

29
Outline
  • Problem statement
  • Prior work and limitations
  • Background
  • A(k)-index
  • Query Evaluation
  • Preliminary experiments
  • Update
  • Conclusions

30
Preliminary Experiments
  • Data used Internet Move Database
    (http//www.imdb.com)
  • 250,000 movies TV shows
  • 460,000 actors, etc
  • XML version 1GB
  • We used subsets of this database ranging from 200
    - 2000 movies
  • Whole database -- future work!

31
Preliminary Experiments
  • Second source Open Directory Project
    (http//www.dmoz.org)
  • Entire source available in RDF format
  • Subsets (entire subtree under a topic, say
    shopping)

32
Storage Model
  • Results independent of any particular storage
    model
  • In-memory rooted graph
  • Performance metrics are abstract
  • Cost total number of nodes visited (graph
    index)

33
Bisimulation Sizes
IMDB Nodes 190,000 ODP Nodes 143,000
34
Query Evaluation Plans
1. Forward eval 2. Backward eval (assume a label
index)
35
Short Queries - IMDB
36
Long Queries - IMDB
37
Queries beginning with _
38
Queries containing _
39
Approximate Answers
40
A(k)-index Update
  • Edge added from u to v
  • A(0)-index - no change except possible addition
    of edge
  • A(1)-index - index node containing v may change
  • determined by set of labels in vs parents

41
A(k)-index Update (contd)
  • A(k)-index
  • only nodes to be considered are those at distance
  • Maintain tree of splits
  • Work iteratively
  • find new A(1) position of v
  • find new A(2) positions of v and its children

42
Updating the 1-index
  • One way is generalization of A(k) update
  • R - any binary relation on the nodes that is
  • reflexive
  • transitively closed.
  • A refinement of R is any subset that is
  • reflexive
  • transitively closed

43
Refinement
  • B - bisimulation relation
  • B - any refinement of B
  • B(G) - index graph built using B
  • B(G) - index graph built using B

44
Theorem
  • Theorem B(B(G)) B(G)
  • Intuition
  • Similar nodes behave similarly
  • So, fuse them together!

45
Lazy Update
  • Basic Idea
  • G ? G , and meanwhile B(G) ? B(G)
  • Instead, relax the graph B(G) to B(G)
  • How?
  • A stable partitioning of G is either B(G) or
    its refinement.
  • Propagate graph update on B(G) by splitting nodes
    until stable.

46
Lazy Update Performance
47
Conclusions
  • Novel approximate index structures and validation
    techniques
  • Experiments demonstrate k-bisimulation index is
  • Efficiently constructed
  • Effective for query answering

48
Future Work
  • Handle more query types
  • Branching queries
  • Queries with selection
  • Annotating A(k) with statistics for query
    optimization
  • Storage
  • Application of update algorithms to triggers
Write a Comment
User Comments (0)
About PowerShow.com