Labeling and Indexing Schemes and Algorithms for the Semantic Web - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Labeling and Indexing Schemes and Algorithms for the Semantic Web

Description:

Suppose you would like to search for titles of articles written by employees of ... The job of the Mediator is to determine where to send which parts of the query ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 36
Provided by: scottpa5
Category:

less

Transcript and Presenter's Notes

Title: Labeling and Indexing Schemes and Algorithms for the Semantic Web


1
Labeling and Indexing Schemes and Algorithms for
the Semantic Web
1 Stuckenschmidt, H. Vdovjak, R. Houben, G.
Broekstra, J. Index Structures and Algorithms
for Querying Distributed RDF Repositories 2
Christophides, V. Plexousakis, D. Scholl, M.
Tourtounis, S, On Labeling Schemes for Semantic
Web
Presentation for CSCI 8350 By Scott Patterson
2
The Problem
  • The Semantic Web is distributed by nature,
    therefore
  • Information may be duplicated at different
    locations
  • The information that is duplicated may be
    expressed in different ways
  • The information desired from a query may be
    stored in fragments

3
The Problem
  • Suppose you would like to search for titles of
    articles written by employees of organizations
    that have projects in the area of RDF
  • Realistically this information is not stored in
    one place. Perhaps there are there RDF stores
    (in this case databases) containing different and
    similar information.

4
The Problem
  • Suppose the first db contains information on
    articles, titles, authors, and their affiliations
  • The second contains information about industrial
    projects, topics and organizations
  • The third contains all the relevant information
    but is a research portal

5
The Problem
  • How can the information we desire be extracted
    and joined in such a way that..
  • We have not lost any information
  • We have not duplicated any information
  • and it is done in a computationally efficient
    manner

6
The Architecture
  • Extend an existing RDF storage and retrieval
    system, Sesame
  • Queries are passed from the query engine to SAIL
    (an RDF API) which abstracts from the repository
  • In a distributed environment, the repositories
    may be implemented in different ways
  • Hence the introduction of another RDF API between
    SAIL and the db

7
The Architecture
  • Now that the heterogeneity of the environment is
    abstracted we are back to the problem
  • How to locate the relevant information, retrieve
    it, and put it together for an answer to our
    query
  • For this another new component is added between
    Sesame and SAIL

8
The Architecture
  • This component is the Mediator SAIL
  • The job of the Mediator is to determine where to
    send which parts of the query and how to optimize
    the query plan

1
9
The Solution
  • The solution to this problem has two parts
  • First, to use join indices
  • Second, find an algorithm that can efficiently
    optimize a query plan that consists of joins

10
Join Indices
  • Create additional database tables that contain
    the result of a join over a specific property
  • Rather than computing a join, a system can simply
    access the index
  • The property here is a path

11
Join Indices
  • Because the information that makes up a path may
    be distributed across different stores, it is
    necessary to use an index structure that contains
    information about sub-paths.
  • A source index is used here to determine where
    instances of a path are stored.
  • This determines where to forward the query

12
Join Indices
  • The join index hierarchy is an adaptation of join
    indices
  • The root of the hierarchy is an index table for
    elements in the path p0, p1,pn-1 of
    length n
  • The next level has two paths of length n-1
  • The last level has n paths of length 1

13
Join Indices
  • The hierarchy contains information about every
    possible sub-path
  • These sub-paths may later be combined to answer
    the query

14
Join Indices
  • From our running example we have
  • P0..3 (author, affiliation, carriesOut, topic)
  • P0..2 (author, affiliation, carriesOut),
    p1..3 (affiliation, carriesOut, topic)
  • p0..1(author, affiliation),
    p1..2(affiliation, carriesOut),
    p2..3(carriesOut, topic)
  • p0(author),
    p1(affiliation),
    p2(carriesOut),
    p3(topic)

1
15
Remember The Problem
  • Suppose the first db contains information on
    articles, titles, authors, and their affiliations
  • The second contains information about industrial
    projects, topics and organizations
  • The third contains all the relevant information
    but is a research portal

16
Join Indices
  • The time complexity of using the join index
    hierarchy is
  • O(s n2) ,where s is the number of sources and
    n is the length of the path, which is polynomial
    time
  • The length of the path is a significant factor

17
Answering Algorithm
  • The algorithm must do several things
  • Determine all possible combinations of sub-paths
  • For each combination determine the source
    containing the results for the sub-paths
  • Join the results into one results for the
    complete path

18
Answering Algorithm
  • The algorithm must guarantee that all possible
    combinations of sub-paths have been investigated
  • To do this, a tree-recursion algorithm is used
  • Splitting a complete path into all possible
    combinations of sub-paths and then joining the
    results is not computationally reasonable

19
Answering Algorithm
  • The solution is to use the tree-recursion
    algorithm along with source information from the
    index hierarchy.

1
20
Answering Algorithm
  • This is an algorithm for a distributed system, so
    communications costs must be taken into account.
  • Since it is over an IP network the communication
    costs will contribute significantly to the over
    all processing costs

21
Answering Algorithm
  • Data is joined by the Mediator, therefore
    minimization of the data that is transferred is
    important.
  • There may be dependencies which allow those joins
    which do not contribute to the result to be
    pruned.
  • Human interaction is necessary.

22
Answering Algorithm
  • Joins need to be ordered in such a way that the
    overall response time is minimized.
  • This is an NP-hard problem.
  • Evaluating all possible combinations of joins is
    impossible
  • Therefore a Good Enough algorithm is used

23
Answering Algorithm
  • The objective of the algorithm now is to avoid a
    bad query plan and not to find the optimal query
    plan.
  • In most cases the optimal plan only improves the
    solution marginally.
  • In order to achieve this goal, experience from
    the database community is used

24
Answering Algorithm
  • A two phase strategy is applied
  • Iterative Improvement (II)
  • Simulated Annealing (SA)
  • The II algorithm
  • Randomly generates several solutions
  • These are used as starting points in the
    traversal
  • The traversal is done by applying a series of
    random moves from a predefined set

25
Answering Algorithm
  • The cost is evaluated for each move
  • The best solution is kept in memory
  • In the SA each sub-optimal solution is explored
    further
  • Like the II, random moves are preformed
  • Lower cost moves are accepted by the system
  • Unlike the II, higher cost moves can also be
    accepted

26
Answering Algorithm
  • Acceptance of a higher cost move depends on the
    temperature of the systems and the cost
    difference
  • This is because initially the system is hot and
    easily accepts moves which yield a higher cost
  • This solution generates a Good Enough query
    plan and guarantees completeness of the result

27
Labeling
  • How can we label the information to increase the
    efficiency of subsumption queries?
  • The data may be stored in a database, but we
    should try to visualize it as a tree or graph
  • Labeling scheme should have a minimal complexity

28
Labeling
2
29
Labeling
  • Bit-Vector
  • Label is a vector of n-bits, n is the number of
    nodes
  • A 1 bit in some position is used to uniquely
    identify nodes
  • A node inherits bits from its ancestors
  • This allows subsumption checking in constant time
  • The construction of the Labels is linear to the
    number of nodes

30
Labeling
2
31
Labeling
  • Prefix
  • A node is labeled with the parent nodes
    identification
  • This allows subsumption checking in constant time
  • NCA can also be determined in constant time
  • Labels can be created in Linear time

32
Labeling
2
33
Labeling
  • Interval
  • A node is labeled with an interval consisting of
    its preorder and postorder number or some
    variation
  • For node u pre(start u), post(end u)
  • An ancestor node of u is before u in preorder and
    after in postorder
  • pre(v) lt pre(u) and post(v) gt post(u)
  • Subsumption constant
    Labels linear to the number of nodes
    -variations may be polynomial

34
Labeling
2
35
Questions
  • ???????
Write a Comment
User Comments (0)
About PowerShow.com