Title: Comparing pathbased and verticallypartitioned RDF databases
1Comparing path-based and vertically-partitioned
RDF databases
- Preetha Lakshmi Chris Mueller
- 12/10/2007
- CSCI 8715
- Shashi Shekhar
2Outline
- Motivation
- Background and related work
- Problem statement
- Our contributions
- Assumptions
- Experimental process
- Results
- Conclusions
3Motivation
- Semantic Web
- libraries
- scientific databases
- industry
- social networks
- Computer-to-computer communication
4RDF Schema
Schema
Instance
5RDF Schema
RDF Triples ltsubject, property,
objectgt ltwww.picasso.net , first, Pablogt
6Related Work
- Triple store
- Property tables
- Class property tables
- Dynamic table model
- Vertically partitioned tables (Abadi, et al
2007)? - Path based approach (Matono, et al 2005)
Require more self joins, normal joins, NULL value
storage
7Vertical Partitioning
- A table is created for each property
First Subject Object 'r1' 'Picasso' 'r4' 'Au
gust'
Last Subject Object 'r1' 'Picasso' 'r4' 'Rod
in'
Paints Subject Object 'r1' 'r2' 'r1' 'r3'
... etc.
8Path-based Model
- Path signatures relate to instance data
Path pathid pathexp 1 '' 2 'first' 3 'las
t' 4 'paints' 5 'titleltpaints' 6 'sculpts'
7 'titleltsculpts'
Resource name pathid root 'r1' 1 'r1' 'r2'
4 'r1' 'r3' 4 'r1' 'r4' 1 'r4' 'Picasso'
2 'r1' 'Pablo' 3 'r1' 'August' 2 'r4' 'Rodi
n' 3 'r4' ...
Our enhancement
9Problem Statement
- Given
- A set of RDF triples
- Vertical partitioning storage model
- Path-based storage model
- Find Query plans for the various categories of
queries under these two storage schemes. - Objective To determine query types that perform
comparatively better or worse in two storage
models - Why is this challenging?
- Need for efficient storage of structured data
- Different application domains use RDF, generic
storage schemes should support a diverse
workload.
10Contributions
- Identification of benchmark queries
- schema, instance, path, and aggregate queries
-
- Enhancement to the path-based schema that
addresses different types of workloads - Comparison of path-based model and vertical
partitioning - Analysis of cyclic queries
11Query Types
Non-path
Path
Schema vs Instance
Aggregate
List
Cycle
Connection
Diameter
Constraints
Relationship
intermediate node
terminal node
- Schema queries
- find all types of artists
- list all property names
- list nodes with 2 or more descendants.
- find the transitive sub-classes of a class
'sculpture' - list properties with 2 or more descendants
- Instance queries
- find the titles of all paintings by Picasso
- select all nodes within one edge-length of R4
- list all the properties of node r4
12Query Types
- Path queries
- find the title of any painting painted by anyone
- display all the titles of work done by artists
- find the names of all the sculptors
- ...with constraint on intermediate node
- find an artist's name where the artifact is a
painting - ...with terminal node constraints
- display all the titles of work done by Picasso
13Query Types
- Path queries
- connection queries
- list all the properties of node r4
- is there a connection between 'Picasso' and
'Guernica'? - diameter queries
- select all nodes in the graph within one
edge-length of R4 - non-simple path queries
- detect loops in the dataset starting at 'Picasso'
- detect loops in the whole dataset
14Query Types
- Aggregate queries
- find all nodes with 2 or more properties
- list all subjects that have two instances of a
single property - Relationship queries
- find any relationship between r1 and r4
15Assumptions
- Using a small dataset, with the assumption that
number of joins and efficiency of the queries
will not change significantly with larger
datasets - No explicit storage of the RDF schema in the
vertically-partitioned scheme (application
independent)? - INSERT, UPDATE, DELETE are insignificant
compared to SELECT - Key nodes in the path-based model are
well-defined - In practice, key nodes, would be generated
dynamically after user load analysis
16Experimental Process
- Setup both schemes in Oracle 10g for the RDF
graph shown earlier - Materialized path lengths in path-based scheme
- Generated query plans
- Analyzed queries based on the validation
parameters - Cycle queries joins are not supported
-
-
- Validation parameters
- Nodes
- Edges
- Number of joins
- Number of tables
- CPU cost
- Storage bytes
17Dataset used for experiment
18Experimental Results
- For CPU cost and bytes (storage) the entry in
the table indicates which scheme used less CPU
cycles or occupied less space. In cases where
both required an identical or similar amount of
computation or storage, we indicate this with
same. - Queries which cannot be answered are indicated by
--.
19Conclusions Observations
- Vertical Partitioning performs well for
- Short path length, terminal node constraints.
- Offers storage benefits for instance queries
without path expressions. - Enhanced Path Based model performs well for
- Schema queries, path queries, cycle queries
- Queries which the original path-based could not
address and the enhanced model could answer - Connection queries and diameter queries
- Path queries with intermediate node constraints
20Conclusion (Cont'd)?
- Both the schemes show the same performance on
instance queries without path expressions. - Both the schemes do not address relationship
queries - Interesting results for cycle queries
- specifying the start node gives a bad performance
than when the start node is not specified - specifying the start node uses Oracle Filter.
-
21Future Work
- Test large and diverse datasets
- Test vertical partitioning with a
column-orientated database like MonetDB - Pruning strategies for cycle queries
- Impose join indexes
- Find approaches to answer relationship queries
- Storage classification based on the application
domain
22Thank You
Please see http//www.cs.umn.edu/cmueller/cs8715
for a copy of the report that accompanies this
presentation, including a full bibliography