Title: Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing
1Efficient SPARQL Query Processing in MapReduce
through Data Partitioning and Indexing
Nie Zhi niezhixuesen_at_163.com
2Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
3Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
4RDF
- Resource Description Framework
- subject-predicate-object expressions (S-P-O)
http//www.mpii.de/yago/resource/
Albert Einstein
Albert Einstein
isCalled
isCalled
S
Albert Einstein
isCalled
isCalled
O
P
wasBornIn
wasBornIn
????????
hasWonPrize
hasWonPrize
Ulm
Nobel Prize in Physics
Nobel Prize in Physics
5SPARQL Query Language for RDF
PREFIX sourcelthttp//www.mpii.de/yago/resource/
gt SELECT ?name ?where WHERE ?who
sourcehasWonPrize Nobel Prize in Physics. ?who
sourceisCalled ?name. ?who
sourcewasBornIn ?where
Query
http//www.mpii.de/yago/resource/
isCalled
isCalled
Albert Einstein
Albert Einstein
isCalled
isCalled
wasBornIn
wasBornIn
????????
name where
Albert Einstein Ulm
???????? Ulm
hasWonPrize
hasWonPrize
Ulm
Nobel Prize in Physics
6RDF knowledge base
- Semantic web , Web2.0
- Extract Knowledge from the Web
- YAGO
- DBpedia
- Freebase
- Billion Triple Challenge
7RDF knowledge base
295 data sets 31 billion RDF triples 504 million
RDF links (September 2011)
8Challenge and Opportunity
- Challenge
- The RDF data is growing rapidly. Researchers are
working with billions of triples. - Relational database has limited ability on
scalability. - Opportunity
- Google GFS, MapReduce, BigTable
- Hadoop implementation of the MapReduce framework
and HDFS - AchievementsYahoo!,Amazon,??,??,??......
- We need to consider the recent achievements for
handling massive scale Web data on clusters
9MapReduceword count
- file1 the weather is good
- file2 today is good
- flie3 good weather is good.
- Map(k1,v1) ? list(k2,v2)
- Reduce(k2, list (v2)) ? list(k3,v3)
Map output
Reduce Input
Reduce Output
- Worker 1
- (the 1)
- Worker 2
- (is 1), (is 1), (is 1)
- Worker 3
- (weather 1), (weather 1)
- Worker 4
- (today 1)
- Worker 5
- (good 1), (good 1),
- (good 1), (good 1)
- Worker 1
- (the 1), (weather 1),
- (is 1), (good 1).
- Worker 2
- (today 1), (is 1), (good 1).
- Worker 3
- (good 1), (weather 1),
- (is 1), (good 1).
- Worker 1
- (the 1)
- Worker 2
- (is 3)
- Worker 3
- (weather 2)
- Worker 4
- (today 1)
- Worker 5
- (good 4)
10Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
11Solution 1
- Directly map the SPARQL into a sequence of
MapReduce Jobs - Pro.
- scalable
- Con.
- a burden on the user in terms of usage and
maintenance - Not support complex query
- No index
- Not consider the RDF data characteristics
12Solution 2
- Map the SPARQL to Pig -gt MapReduce Jobs
- Pro.
- Scalable
- Support complex query
- Con.
- No index
- Not consider the RDF data characteristics
13Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
14Architecture overview
SPARQL Translator
RDF 2 JSON Loader
BGP
Union
Filter
Optional
Transform
Filter
Join
Sort
Group
Built-in Functions
JAQL Query Language
Optimizer
JSON Data Model
Map-Reduce Runtime
HDFS
Cluster Deployment and Management
15JSON
- JSON (JavaScript Object Notation) is a
lightweight data-interchange format - It is based on a subset of the JavaScript
Programming Language - JSON is built on two structures
- A collection of name/value (Key/value) pairs
- An ordered list of values (array)
16RDF to JSON
RDF triple JSON format
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
- JSON is built on two structures
- name/value (Key/value) pairs sAlbert
Einstein - list of values(array)
sAlbert Einstein,
17JAQL
- JAQL is an open-source language for querying JSON
(JavaScript Object Notation) data. - It provides a general parallel data processing
platform on Hadoop - Developed by IBM
18Basic Idea
- SPARQL can be supported on Hadoop by translating
queries into JAQL operators
Filter
Transform
Join
Group
Sort
Built-in Function merge (d1, d2), regex(), etc
19SPARQL to JAQLTransformation
SPARQL Query PREFIX sourcelthttp//www.mpii.de/yago/resource/gt SELECT ?name ?where WHERE ?who sourcehasWonPrize Nobel Prize in Physics. ?who sourceisCalled ?name. ?who sourcewasBornIn ?where.
JAQL Query //read files from hdfs by predicate name 1 read(hdfs('sourcehasWonPrize')) -gt filter .o Nobel Prize in Physics //select -gt transform .s //project 2 read(hdfs('sourceisCalled')) -gt transform .s,.o 3 read(hdfs('sourcewasBornIn')) -gt transform .s,.o //mult-join join 1, 2, 3 where 1.s 2.s and 2.s 3.s into name2.o, where3.o //project to ?name ?where
1
2
3
1
Mapreduce job1
Mapreduce job2
2
3
Mapreduce job3
Mapreduce job4
4
sAlbert Einstein, pisCalled, oAlbert Einstein
20Data storage
- In Hadoop framework,
- a file is the smallest unit of input to a
MapReduce job and read from the disk. - One straightforward partitioning strategy is to
store all the data in one file - Must scan the entire data in the read operation
- Data Partitioning Strategy
21Data Partitioning Strategy
- Horizontal partitioning
- Vertical partitioning
- Clustered property partitioning
22 Horizontal partitioning with JSON
- For example
- Store in HDFS
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
File 1 File name Hash(Subject1)
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
File 2 File name Hash(Subject2)
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai
File 3 File name Hash(Subject3)
sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
23 Vertical Partitioning with JSON
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
- For example
- Store in HDFS
File 3 File name wasBornOnDate
sAlbert Einstein, o1879-03-14
File 1 File name isCalled
sAlbert Einstein, oAlbert Einstein , sAlbert Einstein, o????????
File 4 File name hasWonPrize
sAlbert Einstein, oNobel Prize in Physics , sCharles K. Kao , oNobel Prize in Physics , sFaye Wong, oMTV Video Music Awards
File 2 File name wasBornIn
sAlbert Einstein, oUlm , sCharles K. Kao , oShanghai, sFaye Wong, oBeijing
File 5 File name diedOnDate
sAlbert Einstein, o1955-04-18
24 Clustered property partitioning with JSON
- For example
- Store in HDFS
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
File 1 File name cluster1
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
File 2 File name cluster2
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai , sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
25Partition Index Vertical Partitioning
File 1 File name isCalled
sAlbert Einstein, oAlbert Einstein , sAlbert Einstein, o????????
Inverted Indexs Inverted Indexs
s File list
Albert Einstein isCalled,wasBornIn,wasBornOnDate, hasWonPrize,diedOnDate
File 2 File name wasBornIn
sAlbert Einstein, oUlm , sCharles K. Kao , oShanghai, sFaye Wong, oBeijing
File 3 File name wasBornOnDate
sAlbert Einstein, o1879-03-14
Inverted Indexs Inverted Indexs
o File list
Albert Einstein isCalled,
. .
File 4 File name hasWonPrize
sAlbert Einstein, oNobel Prize in Physics , sCharles K. Kao , oNobel Prize in Physics , sFaye Wong, oMTV Video Music Awards
File 5 File name diedOnDate
sAlbert Einstein, o1955-04-18
26Partition Index Horizontal partitioning
File 1 File name Hash(Subject1)
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
Inverted Indexs Inverted Indexs
p File list
isCalled Hash(Subject1)
File 2 File name Hash(Subject2)
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai
Inverted Indexs Inverted Indexs
o File list
Nobel Prize in Physics Hash(Subject1),Hash(Subject2)
File 3 File name Hash(Subject3)
sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
27Partition Index Clustered property partitioning
File 1 File name cluster1
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
Inverted Indexs Inverted Indexs
s File list
Albert Einstein cluster1
Charles K. Kao cluster2
Faye Wong Cluster2
Inverted Indexs Inverted Indexs
p File list
isCalled cluster1
File 2 File name cluster2
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai , sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
Inverted Indexs Inverted Indexs
o File list
Albert Einstein cluster1
28Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
29Experiments
- DatasetBillion Triples Challenge 2010(BTC10) .
- 3.2B lts, p, o, qgt quads,624 GBsThe resulted of
dataset have 1,426,823,976 unique triples - Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server
64bit. - 30nodes One node is a master, and the others are
slaves - 47G memory, 4.3TB disk space and 24 processor of
Intel(R) Xeon(R) CPU E5645_at_ 2.40GHz - dfs.replication is 2
- JAQL is 0.5.1 version
- Java 1.6
-
30Experiments
Fig. Distribution of data
31Experiments
Fig. Cost time of each query
32Outline
- Introduction
- Related work
- SPARQL Query Processing in MapReduce
- Experiments
- Conclusion
33Conclusion
- Solution for SPARQL queries in MapReduce
- Transforming the queries to JAQL operators
running on Hadoop. - Transformation of SPARQL to JAQL
- Filter, Transform, Join
- Data Partitioning Strategy
- Horizontal partitioning
- Vertical partitioning
- Clustered property partitioning
- Experiments show the performance
- Clustered property partitioning has best
performance - Horizontal partitioning is the worst one
34Scalability
- RDBMS
- Waits and deadlocks are increasing nonlinearly
with the size of the transactions and
concurrency. - Scale-up(Vertical scaling)Commercial RDBMSes are
very, very expensive - SchemaStructured data
- MapReduce
- Linear, High throughput
- Scale-out (horizontal scaling)
- Schema-free Unstructured data
35RDBMS V.S MapReduce
Table . RDBMS compared to MapReduce
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
36Limit of hadoop
- The Apache Hadoop MapReduce framework has hit a
scalability limit around 4,000 machines - The MapReduce JobTracker needs a drastic overhaul
to address several deficiencies in its
scalability, memory consumption, threading-model,
reliability and performance
37The Next Generation of Apache Hadoop MapReduce
- Divide the two major functions of the JobTracker,
resource management and job scheduling/monitoring,
into separate components. - ResourceManager ApplicationMaster
Reliability
Availability
Scalabilitybeyond 10,000 machines
Backward (and Forward) Compatibility
Evolution for customers to control upgrades
Predictable Latency
Cluster utilization
38Conclusion
- Hadoop(MapReduce)
- Pro.
- Scalable
- High throughput
- Con.
- Expense of latency
- No index
- No more than 4000 nodes
- SPARQL on Cloud
- Pro.
- Scalable
- High throughput
- Con.
- Expense of latency
- Complex queryJAQL
- Join operation
39 40Sparql query
- Q1select?X ?Y where?X rdfslabel Albert
Einstein. ?X smcpage ?Y. ?X rdftype
smcSubject. - Q2select ?x ?y ?z where dbscUlm rdftype ?x.
?x rdfslabel ?y. ?x rdfscomment ?z. - Q3select? Who ?Y ?date1 ?Z ?date2 ?prize
where?who sourcebornIn ?Y.?who
sourcebornOnDate?date1.?whosourcediedIn?Z.?whoso
urcediedOnDate ?date2. ?who sourcehasWonPrize
?prize. - Q4select ?x ?author ?title where ?x
purlhasAuthor ?author. ?x purlhasBooktitle ISWC
2009. ?x purlhasTitle ?title. - Q5select distinct ?name ?lat ?long ?pop where
?a propertyname ?name.?a propertyregoin dbsc
Nord-Pas-de-Calais.a poslat ?lat.?a poslong
?long.?a propertypopulation ?pop.
41Sparql query
- Q6 select ?bn ?b ?p where ?a propertyname ?bn.
?a propertydateOfBirth ?b. ?a propertyplaceOfBir
th ?p. - Q7select ?Y ?type ?prize wheresourceAlbert_Eins
tein sourcebornIn ?Y. sourceAlbert_Einsteinrdft
ype?type.sourceAlbert_Einstein
sourcehasWonPrize ?prize. - Q8select ?a ?type ?pub where?a rdftype
?type.?a semwebpublisher ?pub.?a
semwebperiodical_title Theory of Computing
Systems. - Q9select distinct ?a ?lat ?long ?pop where?a
geoontologyname Chevilly.?a geoontologyinCount
ry geocountriesFR.?a poslat ?lat.?a poslong
?long.?a geoontologypopulation ?pop. - Q10select distinct ?l ?long ?lat where?a
propertyplaceOfBirth ?l.?l poslat ?lat.?l
poslong ?long.
42Sparql query
- Q3, Q10 are star join queries with poplar
predicates and unspecified object - Q1, Q4, Q5, Q6, Q8, Q9 are also star join but
with one or more known object. - Q2 is a chain query
- The value of subject is literals in Q7