Alisdair Owens - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Alisdair Owens

Description:

Traditional RDF stores and the Jena Tuple Database ... Provide a clustered triple store for Jena. Focus on 100 machine systems. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 28
Provided by: owen7
Category:
Tags: alisdair | jena | owens

less

Transcript and Presenter's Notes

Title: Alisdair Owens


1
Clustered Triple Storage For Jena
  • Alisdair Owens

Supervised by HP Labs Bristol Andy
Seaborne Southampton mc schraefel Nick
Gibbins
2
Intro
  • Requirement to store large amounts of RDF data
  • Semantic Web
  • Bioinformatics data
  • Scaling RDF stores is difficult
  • Flexibility of RDF does not lend itself to
    efficient storage schemas
  • How can we scale up RDF storage?

3
Clustering!
  • Apply the power of multiple machines to the
    problem.
  • Standard approach in traditional Database
    Management Systems (DBMS)

4
Outline
  • Traditional RDF stores and the Jena Tuple
    Database
  • Clustering Databases Concepts and Techniques
  • Clustered TDB Design and Prototype Evaluation
  • Questions

5
Storing RDF Triples
  • RDF can be expressed as a table of triples
  • Indexes allow simple selection by attribute
  • SPO
  • POS
  • OSP

6
Typical Triple Store Architecture
7
Typical Triple Store
  • Advantages
  • Disadvantages
  • Simple.
  • Does not require the storage of large, variable
    length strings in indexes.
  • No lookups required to convert RDF terms to
    hashes.
  • Expensive reads need to traverse index of node
    table for every unique URI/literal returned.
  • Expensive writes need to traverse index of node
    table and all B-trees for every write.
  • Large hash values necessary to prevent collisions.

8
TDB Architecture
9
TDB
  • Advantages
  • Disadvantages
  • Fast writes.
  • Reduces costs for NodeId to Node conversion.
  • NodeIDs can be smaller than hashes, reducing the
    size of SPO, POS, and OSP indexes.
  • Potentially greater space required due to
    additional Node/NodeID map.
  • Converting URIs/literals to IDs potentially costs
    disk seeks, depending on how much we can keep in
    memory.

10
Distributing Databases
  • Apply the power of multiple machines to the
    problem
  • Desired Improvements
  • Speedup
  • Scaleup
  • Throughput Scaleup

11
Enabling Parallelism
  • Partitioning
  • Pipelining
  • Concurrent Users and Subqueries

12
Barriers to Parallelisation
  • Skew
  • Startup
  • Interference

13
Clustered TDB Objectives
  • Provide a clustered triple store for Jena.
  • Focus on lt100 machine systems.
  • Support for useful and scalable read/write
    performance.
  • Design with support for redundancy in mind.

14
Questions
  • How do we
  • Distribute information redundantly across the
    cluster?
  • Allow rebalancing of data to deal with hot spots,
    data changes, machine additions and removals?
  • Preserve TDBs performance characteristics?
  • Scale in a near linear fashion?
  • Optimise distributed queries?

15
Points of Interest
  • How do we distribute each table/index?
  • TDBs NodeIDs reference a disk location.
  • How can these be extended to refer to a unique
    location on the network?
  • Can we accomplish this while still allowing for
    redundancy and redistribution?
  • Can we preserve append-only writes on the node
    table?
  • How do we optimise queries?

16
Overall System Structure
17
Balancing
  • To enable easy rebalancing, we pretend that our
    100 machine cluster is comprised of (say) 2000
    processing nodes. These are called virtual
    processing nodes, or vnodes.
  • Vnodes 0-19 forward to machine 0, 20-39 to
    machine 1, etc.
  • To rebalance, simply change the vnodes that each
    machine is responsible for, and move data files
    accordingly.

18
NodeIDs
  • NodeIDs are unique 64 bit integers formed as
    follows
  • Encodes the vnode and disk position of the
    ID/node mapping.
  • Compressible.

19
Distributing the Node/NodeID Map
  • Distribute the table based on hash value.
  • High likelihood of even distribution.
  • Inherent knowledge of data location.

20
Distributing the Node Table
  • Distributed based on round robin (or other).
  • Destination computers assign an ID to the node,
    and transmit back to distributing node.
  • IDs encode their own location, so no need for
    more complex distribution.

21
Distributing Triple Indexes
  • Distributed three times based on a hash of S, P,
    and O.
  • Each machine stores triples distributed on S in
    its SPO index, P in POS, O in OSP.
  • Saves space, keeps indexes shorter.
  • Inconsistent content in each index on a single
    node.
  • Queries on individual machines make no sense.

22
Evaluation Load Times
  • Scales in linear fashion with number of
    processing nodes
  • Requires evaluation over larger number of machines

23
Evaluation Query Time
1 user
5 users
24
Evaluation Query Time
1 user
5 users
25
Conclusions
  • Initial evaluations positive
  • Excellent load scaling
  • Query performance as expected

26
Ongoing discussion topics
  • Statistics and Query Optimisation
  • Distributing Operations
  • Questions?

27
Take Away
  • RDF stores have significant performance issues
  • Distribution across multiple systems can
    significantly improve performance.
  • Observations on distributing databases can be
    applied to a wide range of distributed systems
Write a Comment
User Comments (0)
About PowerShow.com