Akindex : Exploiting Local Similarity to Index Paths in Graph Data - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Akindex : Exploiting Local Similarity to Index Paths in Graph Data

Description:

Akindex : Exploiting Local Similarity to Index Paths in Graph Data – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 49

Provided by: rag8

Category:

more less

Transcript and Presenter's Notes

Title: Akindex : Exploiting Local Similarity to Index Paths in Graph Data

1
A(k)-index Exploiting Local Similarity to Index
Paths in Graph Data

Raghav Kaushik (UW)
Pradeep Shenoy (UWash)Philip Bohannon (Bell
Labs)Ehud Gudes (BGU)

2
Outline

Problem statement
Prior work and limitations
Background
A(k)-index
Query Evaluation
Preliminary experiments
Update
Conclusions

3
Data Model

Rooted, node-labeled graph with unique root root
has unique label
Nodes - objects
Arcs - object-subobject relationship
In XML context
Index tag structure
No distinction between elements and attributes
No distinction between tree and idref arcs
Order ignored

4
Problem Statement

Practical indexing schemes for large graph data
(like XML data) (100K - 1M nodes)
Size 10 of database size
Efficient construction and update
Tunable to a workload
Queries of the form R x, where R is a regular
path expression
Schemaless data

5
Flavor of Approach

Different from traditional value indices
Structural summaries for indexing paths
Both data and index are rooted graphs
Example Dataguide

6
Index Graph

Structural summary
Associate a set of data nodes with each index
node, called its extent
Preserve data paths in index graph

7
Example index graph
0
0
2
1
2
1
3,4
4
3
5,6
6
5
Data graph Index graph
8
Index Graph (contd)

Can be constructed from any partition
Node for every equivalence class C
Edge between C and C if exists an edge v v
with v in C and v in C
Preserves data paths, no false drops
Our structures are all index graphs

9
Prior Schemes

Dataguide Goldman, Widom 1997
Deterministic automaton corresponding to data
graph
Each set of data nodes that can be distinguished
by a path query is summarized by a single node in
the index
Can be exponential in size!

10
Prior Schemes (contd)

1-index Milo, Suciu 1999
NFA rather than DFA (smaller)
split graph nodes into equivalence classes based
on incoming paths from the root
Computing best split is PSPACE complete
Go for refinements (approximations)
similarity
bisimilarity

11
Limitations of Prior Work

Size
Dataguide sizes subject to exponential blow-up
1-index size can be big too!
Update
No known update algorithm for 1-index
Designed to answer queries involving arbitrarily
complex paths, but...
such paths may never show up in queries

12
Local Similarity
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
13
Main Contributions

New family of approximate index structures
Applicable to
Approximate Schema
Statistics
Query evaluation using approximate indexes
Preliminary performance study
Update algorithms

14
Approximate Indexes

Motivation
Smaller
More efficient query processing
Limited update cost - maintain local information
Approximate dataguide Goldman, et.al
path merging, object matching, etc
no formal basis (but different goal)
no study of effect on query processing

15
Outline

Problem statement
Prior work and limitations
Background
A(k)-index
Query Evaluation
Preliminary experiments
Update
Conclusions

16
Graph Bisimulation

A bisimulation is a symmetric relation R between
nodes
If A1 R A2 then
A1 and A2 have the same labels
and ...

17
Graph Bisimulation (contd)
and vice-versa!
18
Bisimilarity

Two nodes a and b are bisimilar if they are
related in some bisimulation
1-index is index graph constructed from
bisimulation partition
Simulation partition similar

19
Bisimulation on example
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
20
k-bisimulation

Nodes A1 and A2 are 0-bisimilar iff same label
A1 and A2 are k-bisimilar iff
k-1 bisimilar and
if (B1, A1), exists (B2, A2) B1 and B2 are k-1
bisimilar, and vice versa

21
Example for k-bisimulation
22
A(2) for example
ROOT
metro
cultural
neighborhoods
business
museum
museum
hotel
nhd.
nhd.
nearby
attr.
attr.
cult.
cult.
23
Properties

If a and b are bisimilar
set of incoming paths into them is same
If a and b are k-similar or k-bisimilar
set of incoming paths of length
If k-bisim k1-bisim then k-bisim bisim
Size certainly smaller than bisimulation

24
Query Evaluation

Only queries studied are regular path queries of
the form R x
Query Evaluation Approach
Create automaton for regexp query
Run automaton on the index graph
Result is union of extents belonging to index
nodes accepted by automaton

25
Example Query Evaluation
Automaton Graph
Index Graph
26
Approximate Indexes

Caveat False positives possible
Approach verify each node on data graph by
running reverse automaton
Prohibitive cost?
Then why use approx. indices?
In fact, frequently more efficient than data
graph or precise index

27
Improving Validation

First cut Keep track of accepting-path-length
for accepted nodes with path length verification not required
Second step Share traversals among verification
calls
mark node-state pairs on a successful
verification path as accept
similar marking for failed path

28
Improving Validation (contd)

Third Step Avoid needless verification
Example For _.R queries, no need to verify all
the way up to the root
Generalize the above!

29
Outline

Problem statement
Prior work and limitations
Background
A(k)-index
Query Evaluation
Preliminary experiments
Update
Conclusions

30
Preliminary Experiments

Data used Internet Move Database
(http//www.imdb.com)
250,000 movies TV shows
460,000 actors, etc
XML version 1GB
We used subsets of this database ranging from 200
- 2000 movies
Whole database -- future work!

31
Preliminary Experiments

Second source Open Directory Project
(http//www.dmoz.org)
Entire source available in RDF format
Subsets (entire subtree under a topic, say
shopping)

32
Storage Model

Results independent of any particular storage
model
In-memory rooted graph
Performance metrics are abstract
Cost total number of nodes visited (graph
index)

33
Bisimulation Sizes
IMDB Nodes 190,000 ODP Nodes 143,000
34
Query Evaluation Plans
1. Forward eval 2. Backward eval (assume a label
index)
35
Short Queries - IMDB
36
Long Queries - IMDB
37
Queries beginning with _
38
Queries containing _
39
Approximate Answers
40
A(k)-index Update