Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing - PowerPoint PPT Presentation

Loading...

PPT – Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing PowerPoint presentation | free to download - id: 7c4fd6-OTlmN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

Description:

Title: 1 Author: luwei Last modified by: nz Created Date: 4/15/2009 1:21:00 PM Document presentation format: (4:3) Company: ruc – PowerPoint PPT presentation

Number of Views:3
Avg rating:3.0/5.0
Slides: 43
Provided by: luw75
Learn more at: http://iir.ruc.edu.cn
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing


1
Efficient SPARQL Query Processing in MapReduce
through Data Partitioning and Indexing
Nie Zhi niezhixuesen_at_163.com
2
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

3
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

4
RDF
  • Resource Description Framework
  • subject-predicate-object expressions (S-P-O)

http//www.mpii.de/yago/resource/
Albert Einstein
Albert Einstein
isCalled
isCalled
S
Albert Einstein
isCalled
isCalled
O
P
wasBornIn
wasBornIn
????????
hasWonPrize
hasWonPrize
Ulm
Nobel Prize in Physics
Nobel Prize in Physics
5
SPARQL Query Language for RDF
PREFIX sourcelthttp//www.mpii.de/yago/resource/
gt SELECT ?name ?where WHERE ?who
sourcehasWonPrize Nobel Prize in Physics. ?who
sourceisCalled ?name. ?who
sourcewasBornIn ?where
Query
http//www.mpii.de/yago/resource/
isCalled
isCalled
Albert Einstein
Albert Einstein
isCalled
isCalled
wasBornIn
wasBornIn
????????
name where
Albert Einstein Ulm
???????? Ulm
hasWonPrize
hasWonPrize
Ulm
Nobel Prize in Physics
6
RDF knowledge base
  • Semantic web , Web2.0
  • Extract Knowledge from the Web
  • YAGO
  • DBpedia
  • Freebase
  • Billion Triple Challenge

7
RDF knowledge base
295 data sets 31 billion RDF triples 504 million
RDF links (September 2011)
8
Challenge and Opportunity
  • Challenge
  • The RDF data is growing rapidly. Researchers are
    working with billions of triples.
  • Relational database has limited ability on
    scalability.
  • Opportunity
  • Google GFS, MapReduce, BigTable
  • Hadoop implementation of the MapReduce framework
    and HDFS
  • AchievementsYahoo!,Amazon,??,??,??......
  • We need to consider the recent achievements for
    handling massive scale Web data on clusters

9
MapReduceword count
  • file1 the weather is good
  • file2 today is good
  • flie3 good weather is good.
  • Map(k1,v1) ? list(k2,v2)
  • Reduce(k2, list (v2)) ? list(k3,v3)

Map output
Reduce Input
Reduce Output
  • Worker 1
  • (the 1)
  • Worker 2
  • (is 1), (is 1), (is 1)
  • Worker 3
  • (weather 1), (weather 1)
  • Worker 4
  • (today 1)
  • Worker 5
  • (good 1), (good 1),
  • (good 1), (good 1)
  • Worker 1
  • (the 1), (weather 1),
  • (is 1), (good 1).
  • Worker 2
  • (today 1), (is 1), (good 1).
  • Worker 3
  • (good 1), (weather 1),
  • (is 1), (good 1).
  • Worker 1
  • (the 1)
  • Worker 2
  • (is 3)
  • Worker 3
  • (weather 2)
  • Worker 4
  • (today 1)
  • Worker 5
  • (good 4)

10
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

11
Solution 1
  • Directly map the SPARQL into a sequence of
    MapReduce Jobs
  • Pro.
  • scalable
  • Con.
  • a burden on the user in terms of usage and
    maintenance
  • Not support complex query
  • No index
  • Not consider the RDF data characteristics

12
Solution 2
  • Map the SPARQL to Pig -gt MapReduce Jobs
  • Pro.
  • Scalable
  • Support complex query
  • Con.
  • No index
  • Not consider the RDF data characteristics

13
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

14
Architecture overview
SPARQL Translator
RDF 2 JSON Loader
BGP
Union
Filter
Optional
Transform
Filter
Join
Sort
Group
Built-in Functions
JAQL Query Language
Optimizer
JSON Data Model
Map-Reduce Runtime
HDFS
Cluster Deployment and Management
15
JSON
  • JSON (JavaScript Object Notation) is a
    lightweight data-interchange format
  • It is based on a subset of the JavaScript
    Programming Language
  • JSON is built on two structures
  • A collection of name/value (Key/value) pairs
  • An ordered list of values (array)

16
RDF to JSON
RDF triple JSON format
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
  • JSON is built on two structures
  • name/value (Key/value) pairs sAlbert
    Einstein
  • list of values(array)
    sAlbert Einstein,

17
JAQL
  • JAQL is an open-source language for querying JSON
    (JavaScript Object Notation) data.
  • It provides a general parallel data processing
    platform on Hadoop
  • Developed by IBM

18
Basic Idea
  • SPARQL can be supported on Hadoop by translating
    queries into JAQL operators

Filter
Transform
Join
Group
Sort
Built-in Function merge (d1, d2), regex(), etc
19
SPARQL to JAQLTransformation
SPARQL Query PREFIX sourcelthttp//www.mpii.de/yago/resource/gt SELECT ?name ?where WHERE ?who sourcehasWonPrize Nobel Prize in Physics. ?who sourceisCalled ?name. ?who sourcewasBornIn ?where.
JAQL Query //read files from hdfs by predicate name 1 read(hdfs('sourcehasWonPrize')) -gt filter .o Nobel Prize in Physics //select -gt transform .s //project 2 read(hdfs('sourceisCalled')) -gt transform .s,.o 3 read(hdfs('sourcewasBornIn')) -gt transform .s,.o //mult-join join 1, 2, 3 where 1.s 2.s and 2.s 3.s into name2.o, where3.o //project to ?name ?where
1
2
3
1
Mapreduce job1
Mapreduce job2
2
3
Mapreduce job3
Mapreduce job4
4
sAlbert Einstein, pisCalled, oAlbert Einstein

20
Data storage
  • In Hadoop framework,
  • a file is the smallest unit of input to a
    MapReduce job and read from the disk.
  • One straightforward partitioning strategy is to
    store all the data in one file
  • Must scan the entire data in the read operation
  • Data Partitioning Strategy

21
Data Partitioning Strategy
  • Horizontal partitioning
  • Vertical partitioning
  • Clustered property partitioning

22
Horizontal partitioning with JSON
  • For example
  • Store in HDFS

Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
File 1 File name Hash(Subject1)
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
File 2 File name Hash(Subject2)
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai
File 3 File name Hash(Subject3)
sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
23
Vertical Partitioning with JSON
Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
  • For example
  • Store in HDFS

File 3 File name wasBornOnDate
sAlbert Einstein, o1879-03-14
File 1 File name isCalled
sAlbert Einstein, oAlbert Einstein , sAlbert Einstein, o????????
File 4 File name hasWonPrize
sAlbert Einstein, oNobel Prize in Physics , sCharles K. Kao , oNobel Prize in Physics , sFaye Wong, oMTV Video Music Awards
File 2 File name wasBornIn
sAlbert Einstein, oUlm , sCharles K. Kao , oShanghai, sFaye Wong, oBeijing
File 5 File name diedOnDate
sAlbert Einstein, o1955-04-18
24
Clustered property partitioning with JSON
  • For example
  • Store in HDFS

Albert Einstein isCalled Albert Einstein Albert Einstein isCalled ???????? Albert Einstein wasBornIn Ulm Albert Einstein wasBornOnDate 1879-03-14 Albert Einstein hasWonPrize Nobel Prize in Physics Albert Einstein diedOnDate 1955-04-18 Charles K. Kao hasWonPrize Nobel Prize in Physics Charles K. Kao wasBornIn Shanghai Faye Wong hasWonPrize MTV Video Music Awards Faye Wong wasBornIn Beijing
File 1 File name cluster1
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
File 2 File name cluster2
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai , sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
25
Partition Index Vertical Partitioning
File 1 File name isCalled
sAlbert Einstein, oAlbert Einstein , sAlbert Einstein, o????????
Inverted Indexs Inverted Indexs
s File list
Albert Einstein isCalled,wasBornIn,wasBornOnDate, hasWonPrize,diedOnDate

File 2 File name wasBornIn
sAlbert Einstein, oUlm , sCharles K. Kao , oShanghai, sFaye Wong, oBeijing
File 3 File name wasBornOnDate
sAlbert Einstein, o1879-03-14
Inverted Indexs Inverted Indexs
o File list
Albert Einstein isCalled,
. .
File 4 File name hasWonPrize
sAlbert Einstein, oNobel Prize in Physics , sCharles K. Kao , oNobel Prize in Physics , sFaye Wong, oMTV Video Music Awards
File 5 File name diedOnDate
sAlbert Einstein, o1955-04-18
26
Partition Index Horizontal partitioning
File 1 File name Hash(Subject1)
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
Inverted Indexs Inverted Indexs
p File list
isCalled Hash(Subject1)

File 2 File name Hash(Subject2)
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai
Inverted Indexs Inverted Indexs
o File list
Nobel Prize in Physics Hash(Subject1),Hash(Subject2)

File 3 File name Hash(Subject3)
sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
27
Partition Index Clustered property partitioning
File 1 File name cluster1
sAlbert Einstein, pisCalled, oAlbert Einstein , sAlbert Einstein, pisCalled, o????????, sAlbert Einstein, pwasBornIn, oUlm , sAlbert Einstein, pwasBornOnDate, o1879-03-14 , sAlbert Einstein, phasWonPrize, oNobel Prize in Physics , sAlbert Einstein, pdiedOnDate, o1955-04-18
Inverted Indexs Inverted Indexs
s File list
Albert Einstein cluster1
Charles K. Kao cluster2
Faye Wong Cluster2
Inverted Indexs Inverted Indexs
p File list
isCalled cluster1

File 2 File name cluster2
sCharles K. Kao , phasWonPrize, oNobel Prize in Physics , sCharles K. Kao , pwasBornIn, oShanghai , sFaye Wong, phasWonPrize, oMTV Video Music Awards , sFaye Wong, pwasBornIn, oBeijing
Inverted Indexs Inverted Indexs
o File list
Albert Einstein cluster1

28
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

29
Experiments
  • DatasetBillion Triples Challenge 2010(BTC10) .
  • 3.2B lts, p, o, qgt quads,624 GBsThe resulted of
    dataset have 1,426,823,976 unique triples
  • Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server
    64bit.
  • 30nodes One node is a master, and the others are
    slaves
  • 47G memory, 4.3TB disk space and 24 processor of
    Intel(R) Xeon(R) CPU E5645_at_ 2.40GHz
  • dfs.replication is 2
  • JAQL is 0.5.1 version
  • Java 1.6

30
Experiments
Fig. Distribution of data
31
Experiments
Fig. Cost time of each query
32
Outline
  • Introduction
  • Related work
  • SPARQL Query Processing in MapReduce
  • Experiments
  • Conclusion

33
Conclusion
  • Solution for SPARQL queries in MapReduce
  • Transforming the queries to JAQL operators
    running on Hadoop.
  • Transformation of SPARQL to JAQL
  • Filter, Transform, Join
  • Data Partitioning Strategy
  • Horizontal partitioning
  • Vertical partitioning
  • Clustered property partitioning
  • Experiments show the performance
  • Clustered property partitioning has best
    performance
  • Horizontal partitioning is the worst one

34
Scalability
  • RDBMS
  • Waits and deadlocks are increasing nonlinearly
    with the size of the transactions and
    concurrency.
  • Scale-up(Vertical scaling)Commercial RDBMSes are
    very, very expensive
  • SchemaStructured data
  • MapReduce
  • Linear, High throughput
  • Scale-out (horizontal scaling)
  • Schema-free Unstructured data

35
RDBMS V.S MapReduce
Table . RDBMS compared to MapReduce
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
36
Limit of hadoop
  • The Apache Hadoop MapReduce framework has hit a
    scalability limit around 4,000 machines
  • The MapReduce JobTracker needs a drastic overhaul
    to address several deficiencies in its
    scalability, memory consumption, threading-model,
    reliability and performance

37
The Next Generation of Apache Hadoop MapReduce
  • Divide the two major functions of the JobTracker,
    resource management and job scheduling/monitoring,
    into separate components.
  • ResourceManager ApplicationMaster

Reliability
Availability
Scalabilitybeyond 10,000 machines
Backward (and Forward) Compatibility
Evolution for customers to control upgrades
Predictable Latency
Cluster utilization
38
Conclusion
  • Hadoop(MapReduce)
  • Pro.
  • Scalable
  • High throughput
  • Con.
  • Expense of latency
  • No index
  • No more than 4000 nodes
  • SPARQL on Cloud
  • Pro.
  • Scalable
  • High throughput
  • Con.
  • Expense of latency
  • Complex queryJAQL
  • Join operation

39
  • Thanks!


40
Sparql query
  • Q1select?X ?Y where?X rdfslabel Albert
    Einstein. ?X smcpage ?Y. ?X rdftype
    smcSubject.
  • Q2select ?x ?y ?z where dbscUlm rdftype ?x.
    ?x rdfslabel ?y. ?x rdfscomment ?z.
  • Q3select? Who ?Y ?date1 ?Z ?date2 ?prize
    where?who sourcebornIn ?Y.?who
    sourcebornOnDate?date1.?whosourcediedIn?Z.?whoso
    urcediedOnDate ?date2. ?who sourcehasWonPrize
    ?prize.
  • Q4select ?x ?author ?title where ?x
    purlhasAuthor ?author. ?x purlhasBooktitle ISWC
    2009. ?x purlhasTitle ?title.
  • Q5select distinct ?name ?lat ?long ?pop where
    ?a propertyname ?name.?a propertyregoin dbsc
    Nord-Pas-de-Calais.a poslat ?lat.?a poslong
    ?long.?a propertypopulation ?pop.

41
Sparql query
  • Q6 select ?bn ?b ?p where ?a propertyname ?bn.
    ?a propertydateOfBirth ?b. ?a propertyplaceOfBir
    th ?p.
  • Q7select ?Y ?type ?prize wheresourceAlbert_Eins
    tein sourcebornIn ?Y. sourceAlbert_Einsteinrdft
    ype?type.sourceAlbert_Einstein
    sourcehasWonPrize ?prize.
  • Q8select ?a ?type ?pub where?a rdftype
    ?type.?a semwebpublisher ?pub.?a
    semwebperiodical_title Theory of Computing
    Systems.
  • Q9select distinct ?a ?lat ?long ?pop where?a
    geoontologyname Chevilly.?a geoontologyinCount
    ry geocountriesFR.?a poslat ?lat.?a poslong
    ?long.?a geoontologypopulation ?pop.
  • Q10select distinct ?l ?long ?lat where?a
    propertyplaceOfBirth ?l.?l poslat ?lat.?l
    poslong ?long.

42
Sparql query
  • Q3, Q10 are star join queries with poplar
    predicates and unspecified object
  • Q1, Q4, Q5, Q6, Q8, Q9 are also star join but
    with one or more known object.
  • Q2 is a chain query
  • The value of subject is literals in Q7
About PowerShow.com