Storing RDF Data in Hadoop And Retrieval - PowerPoint PPT Presentation

Loading...

PPT – Storing RDF Data in Hadoop And Retrieval PowerPoint presentation | free to download - id: 77cf17-OGNmY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Storing RDF Data in Hadoop And Retrieval

Description:

Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham * Goal To build efficient ... – PowerPoint PPT presentation

Number of Views:1
Avg rating:3.0/5.0
Date added: 18 May 2018
Slides: 18
Provided by: russoue
Learn more at: http://www.utdallas.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Storing RDF Data in Hadoop And Retrieval


1
Storing RDF Data in Hadoop And Retrieval
  • Pankil Doshi
  • Asif Mohammed
  • Mohammad Farhan Husain
  • Dr. Latifur Khan
  • Dr. Bhavani Thuraisingham

2
Goal
  • To build efficient storage using Hadoop for
    Peta-bytes of data
  • To build an efficient query mechanism
  • Possible outcomes
  • Open Source Framework for RDF
  • Integration with Jena

3
Possible Approaches
  • Store RDF data in HDFS and query through
    Map-Reduce programming
  • Our current approach
  • Store RDF data in HDFS and process query outside
    of Hadoop
  • Done in BIOMANTA 1 project, no details however
  • Hbase
  • Currently being worked on by another team in
    Semantic Web lab

4
Dataset And Queries
  • LUBM 2
  • Dataset generator
  • 14 benchmark queries
  • Generates data of some imaginary universities
  • Used for query execution performance comparison
    by many researches

5
Our Clusters
  • 4 node cluster in Semantic Web lab
  • 10 node cluster in SAIAL lab
  • 4 GB main memory
  • Intel Pentium IV 3.0 GHz processor
  • 640 GB hard drive
  • OpenCirrus HP labs test bed
  • Sponsor Andy Seaborne, HP Labs

6
Tasks Completed/In Progress
  • Setup Hadoop cluster
  • Generate, preprocess insert data
  • Devise algorithm to produce map-reduce code for a
    SPARQL query
  • Code for 14 queries
  • Cascading output of one job to another job as
    input without using hard disk

7
Two Storage Approaches
  • Multiple File Approach
  • Dumping files as generated by LUBM generator,
    possibly merging some
  • Each Line on file Contains Subject, Predicate and
    Object
  • Predicate Based Approach
  • Dividing Files based on Predicate
  • File name will be Predicate name
  • Each line then contains only Subject and Object
  • On-an Average there are about 20 different type
    of Predicate


Common Preprocessing - Adding Prefixes http//www
.University10Department5.... U10D5.
8
Example Of Predicate Based File division
D0U0Graduate20 ubtype lehighGraduateStudent D0U
0Graduate20 ubmemberOf lehighUniversity0
Filename type
D0U0Graduate20 lehighGraduateStudent
Filename memberOf
D0U0Graduate20 lehighUniversity0
Filename type_GraduateStudent D0U0Graduate20
Filename memberOf_University D0U0Graduate20 lehi
ghUniversity0
9
Sample Query- PREFIX rdf lthttp//www.w3.org/199
9/02/22-rdf-syntax-nsgt PREFIX ub
lthttp//www.lehigh.edu/zhp2/2004/0401/univ-bench.
owlgt SELECT ?X WHERE ?X rdftype
ubPublication . ?X ubpublicationAuthor
D0U0AssistantProfessor0
  • Map Function -
  • Look from which file (key) the data (value) is
    coming and filter it according to conditions. For
    example
  • If data is from file type_Publication output
    the pair
  • If data is from file publicationAuthor_ look
    for D0U0AssistantProfessor0 as object
  • Reduce Function -
  • Look for all the required values according to
    condition and output the key as the result
  • Ex Filter those results having both
    ubPublication D0U0AssistantProfessor0

10
Algorithm
  • SELECT ?X, ?Y WHERE
  • ?X rdftype ubChair .
  • ?Y rdftype ubDepartment .
  • ?X ubworksFor ?Y .
  • ?Y ubsubOrganizationOf lthttp//www.University0.ed
    ugt

E 4
  • Job 1 map output keys
  • Y 2, 3, 4 (3 joins)
  • Job 1 joins 3
  • 1 join left, so need more job

Variable Nodes Joins
X 1, 3 1-3
Y 2, 3, 4 2-3, 3-4, 4-2
11
Algorithm (contd.)
Variable Nodes Joins
X A, B A-B
  • Job 2 map output key
  • X A, B (1 Join)
  • Job 2 joins 1
  • No joins left, no more jobs needed

12
Some Query Results
Horizontal axis Number of Triples Vertical axis
Time in milliseconds
13
Query Preprocessing
  • Original query 2?X rdftype ubGraduateStudent
    . ?Y rdftype ubUniversity . ?Z rdftype
    ubDepartment . ?X ubmemberOf ?Z . ?Z
    ubsubOrganizationOf ?Y . ?X ubundergraduateDegr
    eeFrom ?Y
  • Rewritten?X rdftype ubGraduateStudent . ?X
    ubmemberOf_Department ?Z . ?Z
    ubsubOrganizationOf_University ?Y . ?X
    ubundergraduateDegreeFrom_University ?Y

14
Parallel Experiment with Pig
  • Script for query 2/ Load statements /GS
    LOAD type_GraduateStudent AS (gs_subjectchararr
    ay)MO LOAD memberOf_Department AS
    (mo_subjectchararray, mo_objectchararray)SOF
    LOAD subOrganizationOf_University AS
    (sof_subjectchararray, sof_objectchararray)
    UDF LOAD undergraduateDegreeFrom_University
    AS (udf_subjectchararray, udf_objectchararray)
    / Joins / MO_UDF_GS JOIN GS BY gs_subject,
    UDF BY udf_subject, MO BY mo_subject PARALLEL 8
    MO_UDF_GS FOREACH MO_UDF_GS GENERATE
    mo_subject, udf_object, mo_objectMO_UDF_GS_SOF
    JOIN SOF BY (sof_subject, sof_object),
    MO_UDF_GS BY (mo_object, udf_object)MO_UDF_GS_SO
    F FOREACH MO_UDF_GS_SOF GENERATE mo_subject,
    udf_object, mo_object / Store query answer
    /STORE MO_UDF_GS_SOF INTO Query2' USING
    PigStorage('\t')

15
Parallel Experiment with Pig
  • 2 jobs created for query 2
  • For 330 mln triples, answers in 20 mins
  • Direct MapReduce approach takes 10 mins

16
Future Works
  • Run all 14 queries for 100 mln, 200 mln, , 1
    bln triples and compare with Jena In-Memory, RDB,
    SDB, TDB models
  • Cascading output of one job to another job as
    input without using hard disk
  • Generic map reduce code
  • Proof of algorithm
  • Modification of algorithm for queries with
    optional triple patterns
  • Indexing, summary statistics

17
References
  • 1 BIOMANTA http//www.biomanta.org/
  • 2 LUBM http//swat.cse.lehigh.edu/projects/lubm
    /
About PowerShow.com