Comparing pathbased and verticallypartitioned RDF databases - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Comparing pathbased and verticallypartitioned RDF databases

Description:

Comparing path-based and vertically-partitioned. RDF databases. Preetha Lakshmi & Chris Mueller ... INSERT, UPDATE, & DELETE are insignificant compared to SELECT ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 23
Provided by: cfans
Category:

less

Transcript and Presenter's Notes

Title: Comparing pathbased and verticallypartitioned RDF databases


1
Comparing path-based and vertically-partitioned
RDF databases
  • Preetha Lakshmi Chris Mueller
  • 12/10/2007
  • CSCI 8715
  • Shashi Shekhar

2
Outline
  • Motivation
  • Background and related work
  • Problem statement
  • Our contributions
  • Assumptions
  • Experimental process
  • Results
  • Conclusions

3
Motivation
  • Semantic Web
  • libraries
  • scientific databases
  • industry
  • social networks
  • Computer-to-computer communication

4
RDF Schema
Schema
Instance
5
RDF Schema
RDF Triples ltsubject, property,
objectgt ltwww.picasso.net , first, Pablogt
6
Related Work
  • Triple store
  • Property tables
  • Class property tables
  • Dynamic table model
  • Vertically partitioned tables (Abadi, et al
    2007)?
  • Path based approach (Matono, et al 2005)

Require more self joins, normal joins, NULL value
storage
7
Vertical Partitioning
  • A table is created for each property

First Subject Object 'r1' 'Picasso' 'r4' 'Au
gust'
Last Subject Object 'r1' 'Picasso' 'r4' 'Rod
in'
Paints Subject Object 'r1' 'r2' 'r1' 'r3'
... etc.
8
Path-based Model
  • Path signatures relate to instance data

Path pathid pathexp 1 '' 2 'first' 3 'las
t' 4 'paints' 5 'titleltpaints' 6 'sculpts'
7 'titleltsculpts'
Resource name pathid root 'r1' 1 'r1' 'r2'
4 'r1' 'r3' 4 'r1' 'r4' 1 'r4' 'Picasso'
2 'r1' 'Pablo' 3 'r1' 'August' 2 'r4' 'Rodi
n' 3 'r4' ...
Our enhancement
9
Problem Statement
  • Given
  • A set of RDF triples
  • Vertical partitioning storage model
  • Path-based storage model
  • Find Query plans for the various categories of
    queries under these two storage schemes.
  • Objective To determine query types that perform
    comparatively better or worse in two storage
    models
  • Why is this challenging?
  • Need for efficient storage of structured data
  • Different application domains use RDF, generic
    storage schemes should support a diverse
    workload.

10
Contributions
  • Identification of benchmark queries
  • schema, instance, path, and aggregate queries
  • Enhancement to the path-based schema that
    addresses different types of workloads
  • Comparison of path-based model and vertical
    partitioning
  • Analysis of cyclic queries

11
Query Types
Non-path
Path
Schema vs Instance
Aggregate
List
Cycle
Connection
Diameter
Constraints
Relationship
intermediate node
terminal node
  • Schema queries
  • find all types of artists
  • list all property names
  • list nodes with 2 or more descendants.
  • find the transitive sub-classes of a class
    'sculpture'
  • list properties with 2 or more descendants
  • Instance queries
  • find the titles of all paintings by Picasso
  • select all nodes within one edge-length of R4
  • list all the properties of node r4

12
Query Types
  • Path queries
  • find the title of any painting painted by anyone
  • display all the titles of work done by artists
  • find the names of all the sculptors
  • ...with constraint on intermediate node
  • find an artist's name where the artifact is a
    painting
  • ...with terminal node constraints
  • display all the titles of work done by Picasso

13
Query Types
  • Path queries
  • connection queries
  • list all the properties of node r4
  • is there a connection between 'Picasso' and
    'Guernica'?
  • diameter queries
  • select all nodes in the graph within one
    edge-length of R4
  • non-simple path queries
  • detect loops in the dataset starting at 'Picasso'
  • detect loops in the whole dataset

14
Query Types
  • Aggregate queries
  • find all nodes with 2 or more properties
  • list all subjects that have two instances of a
    single property
  • Relationship queries
  • find any relationship between r1 and r4

15
Assumptions
  • Using a small dataset, with the assumption that
    number of joins and efficiency of the queries
    will not change significantly with larger
    datasets
  • No explicit storage of the RDF schema in the
    vertically-partitioned scheme (application
    independent)?
  • INSERT, UPDATE, DELETE are insignificant
    compared to SELECT
  • Key nodes in the path-based model are
    well-defined
  • In practice, key nodes, would be generated
    dynamically after user load analysis

16
Experimental Process
  • Setup both schemes in Oracle 10g for the RDF
    graph shown earlier
  • Materialized path lengths in path-based scheme
  • Generated query plans
  • Analyzed queries based on the validation
    parameters
  • Cycle queries joins are not supported
  • Validation parameters
  • Nodes
  • Edges
  • Number of joins
  • Number of tables
  • CPU cost
  • Storage bytes

17
Dataset used for experiment
18
Experimental Results
  • For CPU cost and bytes (storage) the entry in
    the table indicates which scheme used less CPU
    cycles or occupied less space. In cases where
    both required an identical or similar amount of
    computation or storage, we indicate this with
    same.
  • Queries which cannot be answered are indicated by
    --.

19
Conclusions Observations
  • Vertical Partitioning performs well for
  • Short path length, terminal node constraints.
  • Offers storage benefits for instance queries
    without path expressions.
  • Enhanced Path Based model performs well for
  • Schema queries, path queries, cycle queries
  • Queries which the original path-based could not
    address and the enhanced model could answer
  • Connection queries and diameter queries
  • Path queries with intermediate node constraints

20
Conclusion (Cont'd)?
  • Both the schemes show the same performance on
    instance queries without path expressions.
  • Both the schemes do not address relationship
    queries
  • Interesting results for cycle queries
  • specifying the start node gives a bad performance
    than when the start node is not specified
  • specifying the start node uses Oracle Filter.

21
Future Work
  • Test large and diverse datasets
  • Test vertical partitioning with a
    column-orientated database like MonetDB
  • Pruning strategies for cycle queries
  • Impose join indexes
  • Find approaches to answer relationship queries
  • Storage classification based on the application
    domain

22
Thank You
  • Questions?

Please see http//www.cs.umn.edu/cmueller/cs8715
for a copy of the report that accompanies this
presentation, including a full bibliography
Write a Comment
User Comments (0)
About PowerShow.com