Data Warehousing/Mining Comp 150 DW Semistructured Data - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehousing/Mining Comp 150 DW Semistructured Data

Description:

Where in the database is the string 'Casablanca' to be found? ... 'Find whether 'Allen' acted in 'Casablanca' Need regular expresions to constrain paths ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 34
Provided by: csTu6
Learn more at: https://www.cs.tufts.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 DW Semistructured Data


1
Data Warehousing/MiningComp 150 DW
Semistructured Data
  • Instructor Dan Hebert

2
Semistructured Data
  • Everything that has no rigid schema
  • Schema is contained within the data
    (self-describing), OR
  • No separate schema, OR
  • Schema exists but places only loose constraints
    on data
  • Emerged as an important topic for a variety of
    reasons
  • Many data sources like WWW which we would like to
    treat as databases but cannot for the lack of
    schema
  • Desirable to have an extremely flexible format
    for data exchange between disparate databases
  • May want to view structured data as
    semistructured data for the purpose of browsing

3
Motivation
  • Some data really is unstructured/semistructured
  • World Wide Web,
  • Data exchange formats
  • Some exotic database management systems, e.g.,
    ACeDB, popular with biologists
  • Data integration
  • Browsing

4
Motivation - World Wide Web
  • Why do we want to treat the Web as a database?
  • To maintain integrity
  • To query based on structure (as opposed to
    content)
  • To introduce some organization.
  • But the Web has no structure. The best we can say
    is that it is an enormous graph.

5
Motivation - Data Formats
  • Much (probably most) of the worlds data is in
    data formats
  • These are formats defined for the interchange and
    archiving of data
  • Data formats vary in generality. ASN.1 and XDR
    are quite general
  • Scientific data formats tend to be fixed
    schemas
  • The textual representation given by data formats
    is sometimes not immediately translatable into a
    standard relational/object-oriented representation

6
Motivation - Data Integration
  • Goal is to integrate all types of information,
    including unstructured information
  • Irregular, missing information, structure not
    fully known, dynamic schema evolution, etc.
  • Traditional data models and languages not well
    suited
  • Cannot accommodate heterogeneous data sets
    (different types and structures), etc.
  • Difficult to build software that will easily
    convert between two disparate models
  • OEM (Object Exchange Model)
  • Semistructured data model from TSIMMIS project at
    Stanford
  • Internal data structure for exchange of data
    between DBMSs
  • Used by other systems e.g., Windows 95 registry,
    Lotus Notes

7
Motivation - Browsing
  • To query a database one needs to understand the
    schema.
  • However schemas have opaque terminology and the
    user may want to start by querying the data with
    little or no knowledge of the schema.
  • Where in the database is the string Casablanca
    to be found?
  • Are there integers in the database greater than
    216 ?
  • What objects in the database have an attribute
    name that starts with act?
  • While extensions to relational query languages
    have been proposed for such queries, there is no
    generic technique for interpreting them.

8
The Model
  • Represent data as some kind of graph-like or
    tree-like model
  • Cycles are allowed but usually refer to them as
    trees
  • Several different approaches with minor
    differences (easy to convert)
  • Data on labels or edges, nodes carry information
    or not
  • Straightforward to encode relational and
    object-oriented databases
  • Issue object identity

9
Querying Semistructured Data
  • There are (at least) three approaches to this
    problem
  • Add arbitrary features to SQL or to your favorite
    query language
  • Find some principled approach to programs that
    are based on the type of the data
  • Represent the graph (or whatever the structure
    is) as appropriate predicates and use some
    variety of datalog on that structure

10
The Extend SQL Approach
  • In fact it is an attempt to extend the philosophy
    of OQL and comprehension syntax to these new
    structures
  • It is the approach taken in the design of UnQL
    and also of Lorel
  • Looks very similar to OQL (path expressions)

11
Example
  • select Entry.Movie.Title
  • from DB
  • where Entry.Movie.Director...

12
Syntax Issues
  • Need (path) variables to tie paths and edges
    together
  • Paths of arbitrary length
  • Find all strings in db
  • Find whether Allen acted in Casablanca
  • Need regular expresions to constrain paths
  • Rich set of overloadings for operators to deal
    with comparisons of objects with values and of
    values with sets

13
Underlying Computational Strategy
  • Model graph as a relational database and use
    relational query language.
  • Database large relation (node-id, label, node-id)
  • Used by Stanford group in LORE/LOREL
  • Complications
  • Labels are from heterogeneous set of types, need
    more than one relation
  • Additional relations if info to be stored in
    nodes
  • Various navigation issues

14
Semistructured Data - Case StudyObject Exchange
Model
15
OEM Features
  • Common model for heterogeneous information
    exchange, self-describing
  • Each object

OID
Label
Type
Value
  • OID unique identifier or NULL
  • Label character string descriptor
  • Type atomic data type or set
  • Value atomic value or set of object references
  • Help pages for labels
  • Query language OEM-QL

16
Representing Semistructured Data Using OEM
Label
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, n, pgt n ltname, Jeff Ullmangt p
ltpicture, /gifs/ullman.gifgt a1 ltarticle, v,
w, xgt v ltauthor, Gio Wiederholdgt w lttitle,
Mediators in the gt x ltjournal, IEEE
Computergt
Set Value
Memory Addresses
Atomic Value
...
17
An OEM Query Language OEM-QL
  • Logic-based language for OEM
  • Match object patterns, generate variable
    bindings, construct new OEM objects from existing
    ones
  • Get articles published in IEEE Computer
  • P -
  • Pltarticles ltjournal IEEE Computergtgt
  • Get titles of books by Jeff Ullman
  • ltanswer_title Tgt -
  • ltbook ltauthor Jeff Ullmangt lttitle Tgtgt

18
Semistructured Data - Case StudyWWW Extraction
19
Problem
  • Lots of valuable information on the Web
  • irregular structure
  • highly dynamic
  • Embedded in HTML
  • Limited query facilities

20
Data Extraction Tool
  • Flexible, easy to use
  • Accommodate virtually any HTML source
  • Interface with existing system, e.g., data
    warehouse, user interface for querying

Query
Data Warehouse
World Wide Web
Extractor
WH Integrator
Specification
21
Approach
  • Extract Web data into OEM format
  • Query using OEM-QL
  • Python-based, configurable parser
  • Declarative description of HTML source
  • location of data on page
  • how to package data into OEM
  • Regular expression-like syntax
  • Human intelligence rather than A.I.

22
Extractor Specification
Consists of commands of the form
23
HTML Source File
ltHTMLgt ltHEADgt . . . ltTABLEgt ltTRgt ltTHgtltIgt
header 1 lt/Igtlt/THgt ltTHgtltIgt header 2
lt/Igtlt/THgt ltTHgtltIgt header 3 lt/Igtlt/THgt lt/TRgt
ltTRgt ltTDgt text 1 lt/TDgt ltTDgtltA
HREFhttp//www.stuff/gt text 2 lt/Agtlt/TDgt
ltTDgt text 3 lt/TDgt lt/TRgt . .
. lt/TABLEgt . . . lt/BODYgt lt/HTMLgt
24
Specification File
root, get('http//www.example.test/'),
, __tempvar1, root,
lttablegtlt/tablegt , __tempvar2, split
(__tempvar1,lt/trgt), , rows,
__tempvar21-1, , header1,header2_url
,header2,header3, rows,
lttdgtlt/tdgtltahrefgtlt/agtlttdgtlt/tdgt
25
Result OEM Object
26
Basic SyntaxVariable
  • variable(lpt)
  • optional parameters for specification of
    corresponding OEM object
  • l label name
  • t type
  • p parent object
  • _variable
  • temporary data structure, does not appear as OEM
    object

27
Basic Syntax Source
  • split(variable,token)
  • creates a list with multiple elements using token
    as the element separator
  • get(URL)
  • obtain contents of HTML file at address URL

28
Basic Syntax Patterns
  • token1 token2
  • match and store current input (between tokens)
  • token1 token2
  • match, dont store current input (between tokens)

29
Syntactic Sugar
  • Functions for extracting commonly used HTML
    constructs
  • extract_table(variable),pattern
  • split_table_row(variable)
  • split_table_column(variable)
  • extract_list(variable),pattern
  • split_list(variables)

30
Advanced Features
  • Customization of output
  • structure, label names, data type, ...
  • Extraction across multiple HTML pages
  • Graceful recovery from parse errors
  • resume parsing using next input from source
  • Multiple patterns in single command
  • follow different parse tree depending on
    structure in source

31
Sample Extraction Scenario
. . .
32
Extracted OEM Data
OEM-QL query ltcity C lthigh Hgt lt low Lgtgt -
lttemperature ltcity_temp ltcountry
Germanygt ltcity Cgt lthigh_today Hgt ltlow_today
Lgtgtgt
33
Evaluation
  • Better than
  • writing programs
  • YACC, PERL, etc.
  • A.I.
  • Can do better
  • GUI tool to simplify the generation of extractor
    specification
  • Machine learning or data mining techniques to
    automatically infer structure...
Write a Comment
User Comments (0)
About PowerShow.com