Scalable SPARQL Querying of Large RDF Graphs - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Scalable SPARQL Querying of Large RDF Graphs

Description:

Title: 1 Author: Leon Description: TR Template of SOARingLab Last modified by: Jxulie Created Date: 10/5/2005 1:11:43 AM Document presentation format – PowerPoint PPT presentation

Number of Views:205

Avg rating:3.0/5.0

Slides: 43

Provided by: Leon110

Category:

more less

Transcript and Presenter's Notes

Title: Scalable SPARQL Querying of Large RDF Graphs

1
Scalable SPARQL Querying of Large RDF Graphs

Xu Bo
2012.06.11

In PVLDB, 4(21), 2011
2
Outline

About Presenter
Semantic Web
Previous Work
New Problem
SYSTEM ARCHITECTURE
EXPERIMENTS
CONCLUSIONS AND FUTURE WORK

3
About Presenter

Daniel Abadi
Associate Professor of Computer
Science in Yale University
Research
Column-Oriented Database Systems
Petascale Parallel Database Systems (HadoopDB)
Semantic Web Data Management

4
Semantic Web

The vision of Semantic Web is to build a "web of
data" that enables machines to understand the
semantics of information on the Web

5
Google Knowledge Graph
6
Key Technology

HTML
XML

7
The Disadvantage of XML

David Billington is a lecturer of Discrete
Mathematics.
there is no standard way of assigning meaning to
tag nesting

8
The Disadvantage of Xpath

Suppose we want to collect all academic staff
members. A path expression in Xpath might be
//academicStaffMember
XML is semantically unsatisfactory

9
RDF

Resource Description Framework
?Web???(?????????,Uniform Resource
Identifiers?URIs)?????,??????(property)?????????

10
RDF as Triples and a Graph
11
SPARQL

RDF query language
A basic graph pattern
Answering SPARQL can be seen as finding subgraphs
in the RDF data that match the graph pattern

12
Example for Star Pattern

Find the names of the strikers that play for FC
Barcelona.

13
Another Example

Find football players playing for clubs in a
populous region where they were born.

14
(No Transcript)
15
Previous Work

RDF In RDBMSs
Property Tables
Vertically Partitioned Approach

16
RDF In RDBMSs

Get the title of the book(s) Joe Fox wrote in 2001

17
Property Tables
18
Vertically Partitioned Approach
19
New Problem

Single node RDF management systems are abundant
Sesame
Jena
RDF3X
3store
Research in clustered RDF management is less
significantly explored The focus of the talk

20
SYSTEM ARCHITECTURE
21
Graph Partitioning

Hash vs. Graph partitioning
Hash Only efficient for star patterns
Graph Taking advantage of graph model

22
Graph Partitioning

Edge vs. Vertex partitioning
Edge Natural but inefficient for query execution
Vertex Superior for common graph patterns

23
Vertex Partitioning

Preprocess
remove triples whose predicate is rdftype
METIS partitioner

24
Triple Placement

Minimizing data shuffling/exchange
Allowing data overlap
N-hop guarantee
The extent of data overlap
If a vertex is assigned to a machine, any vertex
that is within n-hop of this vertex is also
stored in this machine

25
DIRECTED N-HOP GUARANTEE
26
A potential problem

triples (s, p, o) and (o, p, o)
2-hop guarantee
triples (s, p, o) and (s, p, o)
not guaranteed
object-connected is not unusual
undirected n-hop guarantee

27
Triple Placement Algorithm
28
Query Processing

Queries are executed in RDF-stores and/or Hadoop
Query execution is more efficient in RDF-stores
than in Hadoop
Pushing as much of the processing as possible
into RDF-stores
Minimizing the number of Hadoop jobs
The larger the hop guarantee, the more work is
done in RDF-stores

29
To Communicate, or not to Communicate

Given a query and n-hop guarantee, is
communication (Hadoop job) between nodes needed?
Choose the center of the query graph
Calculate the distance from the center to the
furthest edge
If distance gt n, communication is needed not
needed otherwise

30
Determining whether a Query is PWOC

PWOC Query
parallelizable without communication
DoFE
distance of farthest edge
the vertex in a graph with the smallest DoFE will
be the most central in a graph

31
The algorithm
32
the issue of duplicate results

naive approach
remove duplicates after the query has completed
owner-computes model
add triples (v, ltisOwnedgt, Yes) to a
partition
For each query issued to the RDF-stores, add an
additional pattern (core, ltisOwnedgt, Yes)

33
A query is not PWOC

decompose the query into PWOC subqueries
use Hadoop jobs to join the results of the PWOC
subqueries
The number of Hadoop jobs required to complete
the query increases as the number of subqueries
increases

34
minimal number of subqueries

reduces to the problem of finding minimal edge
partitioning of a graph into subgraphs of bounded
diameter
brute-force

35
Examlple
DoFEs for manager, footballClub, Barcelona and
club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and
club are 3, 3, 2, 2 and 2,
36
Decompose Example
37
EXPERIMENTS