Managing Semi-Structured Data - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Managing Semi-Structured Data

Description:

Easy to create web information. Cannot all be stored in relational databases ... imposed by schemas, in terms of inflexibility and lack of evolution. ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 12
Provided by: davidw8
Category:

less

Transcript and Presenter's Notes

Title: Managing Semi-Structured Data


1
ManagingSemi-Structured Data
2
Is the web a database?
3
RulesWhat Rules?
The web changed the digital information rules.
  • Easy to create web information
  • Cannot all be stored in relational databases
  • Cannot be queried in traditional ways

4
Semi-structured Data
  • Fully structured data
  • Databases
  • Hidden web
  • Fully unstructured dataordinary text
  • Semi-structured datathe grey area in between
  • No good solutions no good software, tools, or
    methodologies to manipulate semi-structured
    data
  • Researchers dont even agree on the shape of
    the problemmuch less, good approaches to solving
    it.

5
Nature of the Problem
  • Information embedded in text
  • Keyword search insufficient to answer queries
  • Natural language processing also insufficient
  • Lack of agreement of vocabularies and schemas
  • Reaching schema agreements among different
    communities is one of the most expensive steps in
    software design.
  • We need to be able to process information
    without requiring a priori schema and
    vocabulary agreements among participants.

6
Example eBay
  • Impossible for developers to define an a
    priori schema for the information.
  • Information stored in raw text and searched
    using only keywords, significantly limiting its
    usability.
  • Some standard entities (e.g., buyer, date, ask,
    bid ), but the meat of the informationthe item
    descriptionshas a rich and evolving structure
    that isnt captured.

7
Why Schemas?
  • Schemas assign meaning to the data and allow
    automatic data search, comparison, and
    processing.
  • Hierarchy of meaning
  • Raw text strings (values)
  • Data attribute-value pairs
  • Information data in a conceptual framework
  • Knowledge information with a degree of certainty
    or community agreement
  • Meaning knowledge that is relevant or activates
  • We have to learn to use and exploit schemas as
    helpers, but not rely on their existence or allow
    them to be constraining factors.

8
Schema-Agnostic Tools
Possible Places to Start
  • Information retrieval (sophisticated search
    engines?)
  • Find (maybe?) but not answer
  • No DB-like query logic, updates, transactions
  • XML
  • XML data can exist w/wo schemas schemas can be
    defined before or after
  • Mixed text/data content
  • Languages for query (XQuery) and transformation
    (XSLT)
  • OWL RDF
  • RDF subject-predicate-object triples
  • OWL ontological descriptions usually over RDF
    triples
  • Classification inferencing
  • Semantic annotation and tagging

9
Are We Stuck?
Whats Next?
  • Better information-authoring tools (annotation
    assistance)
  • Information extraction (automatic annotation)
  • Creation and reuse of standard schemas and
    vocabularies (ontology generation)
  • Mapping schemas to each other (schema mapping)
  • Automatic data linking (data linking merging)
  • Automatic processing of semi-structured data
    (free-form queries)

Florescu (Embley)
10
Dataspace System
Whats beyond a database system?
  • Supports data and applications in a wide variety
    of formats all within a dataspace.
  • Offers an integrated means of searching,
    querying, updating, and administering the
    dataspace.
  • Has varying levels of service (e.g. best-effort
    or approximate answers)
  • Includes tools to create tighter integration of
    the data, as necessary.

Franklin, Halevy, Maier
11
We are still at day one.
We need to find a compromise to the tension
between the advantages of having schemas, in
terms of better understanding and automatically
processing the data, and disadvantages imposed by
schemas, in terms of inflexibility and lack of
evolution.
Florescu
Write a Comment
User Comments (0)
About PowerShow.com