Web Data Management

About This Presentation

Title:

Web Data Management

Description:

Huge, widely distributed, heterogeneous collection of semi ... Mediator - integration of data-softwares that access multiple source from a uniform interface ... – PowerPoint PPT presentation

Number of Views:1160

Avg rating:3.0/5.0

Slides: 123

Provided by: skm8

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Data Management

1
Web Data Management

Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla
madrias_at_umr.edu

2
WWW

Huge, widely distributed, heterogeneous
collection of semi-structured multimedia
documents in the form of web pages connected via
hyperlinks.

3
World Wide Web

Web is fast growing
More business organizations putting information
in the Web
Business on the highway
Myriad of raw data to be processed for information

4
As WWW grows, more chaotic it becomes

Web is fast growing, distributed,
non-administered global information resource
WWW allows access to text, image, video, sound
and graphic data
More business organizations creating web servers
More chaotic environment to locate information of
interest
Lost in hyperspace syndrome

5
Characteristics of WWW

WWW is a set of directed graphs
Data in the WWW has a heterogeneous nature,
self-describing and schema less
Unstructured information , deeply nested
No central authority to manage information
Dynamic verses static information
Web information discoveries - search engines

6
Web is Growing!

In 1994, WWW grew by 1758 !!
June 1993 - 130
June 1994 - 1265
Dec. 1994 - 11,576
April 1995 - 15,768
July 1995 - 23,000
2000 - !!!!!

7
COM domains are increasing!

As of July 1995, 6.64 million host computers on
the Internet
1.74 million are com domains
1.41 million are edu domains
0.30 million are net
0.27 million are gov
0.22 million are mil
0.20 million are org

8
Top web countries

1. Canada (1) 80 9. New Zealand(7)101
2. US (4) 140 10. Sweden (9) 101
3. Ireland (3) 110 11. Israel (12) 112
4. Iceland (2) 68 12. Cyprus (8) 72
5. UK (14) 336 13. Hong Kong (15)148
6. Malta (5) 155 14. Norway (10) 64
7. Australia (6) 133 15. Switzerland (13) 75
8. Singapore (11) 207 16. Denmark (16) 105

9
How users find web sites

Indexes and search engines 75
UseNet newsgroups 44
Cool lists 27
New lists 24
Listservers 23
Print ads 21
Word-of-mouth and e-mail 17
Linked web advertisement 4

10
Limitations of Search Engines

Do not exploit hyperlinks
Search is limited to string matching
Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
Low accuracy
Replicated results
No further manipulation possible

11
Limitations of Search Engines

ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery

12
More PROBLEMS

specifying/understanding what information is
wanted
the high degree of variability of accessible
information
the variability in conceptual vocabulary or
ontology used to describe information
complexity of querying unstructured data

complexity of querying structured data
uncontrolled nature of web-based information
content
determining which information sources to
search/query

Search Engine Capabilities
Selection of language
Keywords with disjunction, adjacency, presence,
absence, ...
Word stemming (Hotbot)
Similarity search (Excite)
Natural language (LycosPro)
Restrict by modification date (Hotbot) or range
of dates (AltaVista)
Restrict result types (e.g., must include images)
(Hotbot)
Restrict by geographical source (content or
domain) (Hotbot)
Restrict within various structured regions of a
document (titles or URLs) (LycosPro) (summary,
first heading, title, URL) (Opentext)

15
SEARCH RETRIEVAL

Search Engines

Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3

using several search engines is better than
using only one
Source Lawrence, S., and Giles, C.L., Searching
the World Wide Web, Science 280, pp. 98-100,
1998.

16
Key Objectives

Design a suitable data model to represent web
information
Development of web algebra and query language,
query optimization
Maintenance of Web data - view maintenance
Development of knowledge discovery and web mining
tools
Web warehouse
Data integration , secondary storages, indexes

17
Web Data Representation

HTML - Hypertext Markup Language
fixed grammar, no regular expressions
Simple representation of data
good for simple data
difficult to extract information
SGML - Standard Generalized Markup
Language - good for publishing deeply structured
document
XML - Extended Markup Language -a subset of SGML

18
Terminology

HTML - Hypertext Mark-up Language
HTTP - Hypertext Transmission Protocol
URL - Uniform Resource Locator
example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
megtltlocationgt where
ltprotocolgt is http, ftp, gopher
host is internet address
location is a textual label in the file.

Links are specified as
ltA HREFDestination URLgtAnhor Textlt/Agt
destination URL is the URL of the destination
document and Anchor Text is the text that appears
as an anchor when displayed.
Example
ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
Technological Universitylt/Agt
Absolute and relative
URL ltA HREF"AtlanticStates/NYStats.html"gtNew
Yorklt/Agt is relative
ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
/ WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
to HTMLlt/Agt absolute address

20
World Wide Web

Prevalent, persistent and informative

HTML documents (soon, XML) created by humans or
applications.

Accessed day in and day out by humans and
applications.

Persistent HTML documents!!!

Can database technology help?
21
Current Research Projects

Web Query System
W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog, Araneus
Semistructured Data Management
LOREL, UnQL, WebOQL, Florid
Website Management System
STRUDEL, Araneus
Web Warehouse
WHOWEDA

22
Main Tasks

Modeling and Querying the Web
view web as directed graph
content and link based queries
example - find the page that contain the word
clinton which has a link from a page containing
word monica.

Information Extraction and integration
wrapper - program to extract a structured
representation of the data a set of tuples from
HTML pages.
Mediator - integration of data-softwares that
access multiple source from a uniform interface
Web Site Construction and Restructuring
creating sites
modeling the structure of web sites
restructuring data

24
MEDIATOR ARCHITECTURE
User Interface
Mediator (Query/Search/ Retrieval/Result)
Wrapper
Wrapper
. . .
25
What to Model

Structure of Web sites
Internal structure of web pages
Contents of web sites in finer granularities

26
Data Representation of Web Data

Graph Data Models
Semistructured Data Models (also graph based)

27
Graph Data Model

Labeled graph data model where node represents
web pages and arcs represent links between pages.
Labels on arcs can be viewed as attribute names.
Regular path expression queries

28
Semistructured Data Models

Irregular data structure, no fixed schema known
and may be implicit in the data
Schema may be large and may change frequently
Schema is descriptive rather than perspective
describes the current state of data, but
violations of schema is still tolerated

Data is not strongly typed for different objects
the values of the same attributes may be of
differing types. (heterogenious sources)
No restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes
Ability to query the schemas acr variables which
get bound to labels on arcs, rather than nodes in
the graph

30
Graph based Query Languages

Use graph to model databases
Support regular path expressions and graph
construction in queries.
Examples
Graph Log for hypertext queries
graph query language for OO

31
Query Languages for Semi-Structured data

Use labeled graphs
Query the schema of data
Ability to accommodate irregularities in the
data, such as missing links etc.
Examples Lorel (Stanford) , UnQL (ATT), STRUQL
(ATT)

32
Comparison of Query Systems
33
Types of Query Languages

First Generation
Second generation

34
First Generation Query Languages

Combine the content-based queries of search
engines with structure-based queries
Combine conditions on text pattern in documents
with graph pattern describing link structures
Examples - W3QL (TECHNION, Israel)
WebSQL (Toronto), WebLOG (Concordia)

35
Second generation languages

Called web data manipulation languages
Web pages as atomic objects with properties that
they contain or do not contain certain text
patterns and they point to other objects
Useful for data wrapping, transformation, and
restructuring
Useful for web site transformation and
restructuring

36
How they Differ?

Provide access to the structure of web objects
they manipulate - return structure
Model internal structures of web documents as
well as the external links that connect them
Support references to model hyperlinks and some
support to ordered collections of records for
more natural data representation
Ability to create new complex structures as a
result of a query

37
Examples

Web OQL
STRUQL
Florid

38
W3QS (WWW Query System) at Technion - Israel

Content queries
Structural Queries
Interfacing with user written programs and
standard UNIX utilities
Uses existing WWW indexes and search Services
Provides view update facility

39
W3QS

Accessible via any WWW browsers
API can be used by programs running anywhere in
the Internet
Support queries on the web structure by
specifying starting page, a search domain and
depth of links.
File content analysis tools and filling up of
forms automatically

40
File Types

Strict Inner Structure files such as Unix
environment files - Semantics of the data is
clearly linked to the syntax
Semi-structured files - text files containing
formatting codes such as Latex or HTML files-
possible to use formatting codes to analyze their
semantic content
Raw Files - no relation between meaning of file
and its inner structure

41
Content Queries

Queries based on the content of a single node of
hypertext
SQLCOND is used to evaluate boolean expressions
Example - node-format Latex and Node.author
Sanjay

42
Structure Queries

Information conveyed in the hypertext
organization itself is conveyed.
The result is a set of nodes and links from the
hypertext structure that satisfy a given graph
pattern graph with nodes and edges are annotated
with conditions.
Components are pattern definition, search engines
and form completion

43
Structure Query
Node2.author Sanjay
Link1. revdoc
Node1.title Good article
Answer URL http//../myarticles.html
URL http///.tex ltTitlegt Good articleslt/Titlegt
\author sanjay A HREF//..revdocgt
44
Search for an article

Select cp n2/ result
from n1, l2, n2
where n1 in importantindexs.url
Fill n1.form as IN importantindexes.fil with
Keyword sanjay SQLCOND (n2.format Laytex)
and (n2.authorsanjay)

45
Query to search hypertext pattern

Return all the articles cited in the first
chapter of the book. Each chapter includes
several pointers to the bibliography, for example
ltA HREFhttp//cs/refrences.htmlref2gt
Relativitylt/gt means link Relativity leads to
the label ref2 in the references.html file.
In the references.html file the labeled link
looks like ltA HREF./relative.texnameref2gt
relativity, sanjaylt/Agt this link points to
relative.tex

Select cp art/ result from Ind,
l1,chap,l2,ref,l3 art where SQLCOND (ind.url
http//) And (chap.url /.chapter-1.html/) AND
l2.HREF /.\13.Name/)
USING BFS.

47
Url http//cs.tech/bookindex.html INDEX Chapter
1 Chapter 2 References
Url http///Chapter-1.html ref 1 ref 2 ref 3
l1
http//relative.tex
l3
ref 1 ref 2 ref 3
article
48
WebSQL-University of Toronto

Model web as relational database
Use two relations Document and Anchor
Document relation has one tuple for each document
in the web and the anchor relation has one tuple
for each anchor in each document

49
WebSQL

SQL-like query language for extracting
information from the web.
Capable of systematic processing of either all
the links in a page, all the pages that can be
reached from a given URL through paths that match
a pattern, or a combination of both.
Provides transparent access to index servers

50
Document
51
Anchor
52

Give documentss URLs which contain same title
and keyword(s)
Select d1.url, d2.url from
document d1 such that d1 MENTIONS keyword1 and
document d2 such that d2 MENTIONS keyword1
where d1.title d2.title
and NOT (d1.url d2.url)

53
Find Labels of all Hyperlinks to Postscript
FilesSELECT a.labelFROM Anchor a SUCH THAT
base "http//www.SomeDoc.html"WHERE a.href
CONTAINS ".ps.Z"

54
Documents about Databases
SELECT Document d.url, d.titleFROM d SUCH THAT
"http//www.OtherDoc.html" -gtgt dWHERE d.title
CONTAINS "databases" Note -gt path of length
one within same servergt path of length of one
but different server

55
Retrieve all the documents in the same server
that are pointed tofrom the documentWhose URL
is given

Select d.url, d.title from
Document d SUCH THAT
http//www. Cs.in -gt d

56
Find all broken links in a page

SELECT a.hrefFROM Anchor a SUCH THAT base
"http//the.document.to.test"WHERE
protocol(a.href) "http" AND doc(a.href) null

57
Web OQL (University of Toronto)

Provides a framework that supports a large class
of data
Restructuring operations.
Simple semistructured data model for documents
and record-based data
OQL-like syntax and regular expressions
Serves as a two-way bridge between databases and
the Web.

58
DATA MODEL

Hypertrees are Ordered arc labeled trees with two
types of arcs internal and external
Internal arcs represent structured objects
External arcs to represent refrences (huperlinks)
among objects.
Records as labels in the arcs
Sets of related hypertrees as Web

59
ARCHITECTURE

Wrappers map all data sources to trees
The mapping can be done all at once or on
demand

60
Example

Extract from cspapers (paper database) title and
URL of the full version of papers of Smith
select y.title,y.URL
from x in cs papers, y in x
where y.authors smith

61
Web Creation

Create a new page for each research Group (using
the group name as URL). Each page contains the
publications of the corresponding group.
Select x as x.group from x in cspapers
Select q1 as s1, q2 as s2, ...qm as sm
where qs are queries and each Ss is either a
string query or keyword schema. as clause
create a URLs s1 , ..sm assigned to each new
page resulting from each query.

62
ARANEUS

Data Model called ADM for Web Documents - nested
web objects, page schemas
Several languages for wrapping, querying,
creating and updating web sites - object algebra
Methods and Techniques for Web Site Design and
Implementation
Presentation in SIGMOD99
Software is available at their home site

Wrappers - map logical access to attribute values
in a page at the ADM level tp physical access to
text in the HTML source using EDITOR
ULIXES - SQL-like query languages
PENELOPE - manipulation language
Site integration, semantic heterogeneities
Materialized views
http//poincare.dia.uniroma3.it8080/Araneous

64
Lore - motivation

The data may be irregular and thus not conform to
a rigid schema.
Relational data model has null values, and OO
models have inheritance and complex objects. Both
have difficulties in designing schemas to
incorporate irregular data.
It may be difficult to decide in advance on a
single, correct schema, The structure of the data
may evolve rapidly, data elements may change
types, or data not conforming to previous
structure may be added.

Thus, there is a need for management of
semi-structured data!
Lore system manages semi-structured data. The
data managed by Lore is not confined to a schema
and it may be irregular or incomplete.
OEM is the Lores data model. OEM - object
Exchange Model - graph based self-describing
object instance model where nodes are objects and
edges are labeled with attribute names and leaf
nodes have atomic values
Lore is light weight object repository and Lorel
is Lores query language.

66
Object Exchange Model - OEM

Motivation - information exchange and extraction

Why a new data model? it not a new model.

Each value exchanged is given an explicit label.
Object ?temp-in-Fahrenheit, integer, 80? -
temp-in-Fahrenheit is the label. Each object is
self-describing, with a label, type and value.
?set-of-temps, set, cmpnt1, cmpnt2 ?
cmpnt1 is ?temp-in-Fahrenheit, integer, 80?
cmpnt2 is ?temp-in-Celsius, integer, 20?

67
Labels

Plays two roles
identifying an object (component)
identifying the meaning of an object (component)

?person-record, set, cmpnt1, cmpnt2, cmpnt3 ?
cmpnt1 is ?person-name, string, Fred?
cmpnt2 is ?office-num-in-bldg-5, integer, 333?
cmpnt3 is ?department, string, toy?

Person-name both identifies cmpnt1 and coveys its
meaning.

In relational data this corresponds to .

68
Labels - Issues

What does the label mean?
Database of labels
Ontology of labels - within each source

Labels are relative (more specific) to the source
of the data object.
Similar labels from different sources need to be
resolved.

Labels provide the flexibility in representing
object structure

69
Self-describing data models

Have been in existence for a long time? Why
additional interest now?

Use the nature of self-describing data model
for information exchange, and to extend the model
to include object nesting.
To provide an appropriate object request language
(query facility)

70
OEM - Specification

Each object in OEM has the following structure

Label A variable character string describing
what the object represents.
Type The data type of the objects value. Each
is either an atom type, or type set.
Value A variable-length value of the object.
Object-ID A unique variable-length identifier
for the object or null.

71
OEM - Summary

OEM is an information exchange model. It does not
specify how objects are stored at source.

OEM does specify how objects are received at a
client, but after objects are received they can
be stored in any way the client likes.

Each source has a distinguished object with
lexical identifier root.

Note the schema-less nature of OEM is
particularly useful when a client does not know
in advance the labels or structure of OEM objects.

ltbiblio,set,doc1,doc2,,docngt
doc1 is ltdoc, set, auths1, topic1, call-no1gt
auths1 is ltauth-set,set auth11gt
auth11 is ltauth-ln, string, Ullmangt
topic1 is lttopic, string,Databasesgt
call-no1 is ltinternal-call-no, integer, 25gt
doc2 is ltdoc, set, auths2, topic2, call-no2gt
auths2 is ltauth-set,set auth21, auth22,
auth23gt
auth21 is ltauth-ln, string, Ahogt
auth22 is ltauth-ln, string, Hopcroftgt
auth23 is ltauth-ln, string, Ullmangt

Example

topic2 is lttopic, string,Algorithmsgt
call-no1 is ltdewey-decimal, string, BR273gt
docn is ltdoc, set, authsn, topicn, call-nongt
authsn is ltauth,string, Crichtongt
topic1 is lttopic, string,Dinosaursgt
call-no1 is ltfictional-call-no, integer, 95gt
biblio is the root object.

73
OEM - QL

SELECT Fetch-expression
FROM Object
WHERE Condition
The result of this query is itself an object,
with special label answer
?answer, set, obj1, obj2, , objn ?
Each returned obji is a component of object
specified in the From clause of the query, where
the component is located by the Fetch-expression
and satisfies the Condition.

74
Path

The notion of path is used in both
Fetch-Expression in the Select clause and the
condition in the Where clause.
Path describes traversals through an object using
subobject structure and labels.
Example biblio.doc.auth
Paths are used in Fetch-Expression to specify
which components are are returned in the answer
object.
Paths are used in the condition to qualify the
fetched objects or other (related) components in
the same object structure.

75
Queries - Simple

Retrieve the topic of each document for which
Ullman is one of the authors
SELECT biblio.doc.topic
FROM root
WHERE biblio.doc.auth-set.auth-ln Ullman
Intuitively, the querys where clause finds all
paths through subobject structure with the
sequence of labels biblio,doc,auth-set,auth-ln
such that the object at the end of the path has
value Ullman.
ltanswer, set, obj1, obj2gt
obj1 is lttopic, string, Databasesgt
obj2 is lttopic, string, Algorithmsgt

76
Queries - wild-cards

Retrieve all documents with internal call number
SELECT biblio.?.topic
FROM root
WHERE biblio.?.internal-call-no
? label matches any label. For this query,
the doc labels can be replaced by any other
strings and query would produce the same result.
By convention, two occurrences of ? In the same
query must match the same label unless variables
are used.
ltanswer, set, obj1gt
obj1 is lttopic, string, Databasesgt

77
Queries - wild-paths

Retrieve all documents with internal call number
SELECT .topic
FROM root
WHERE .internal-call-no
Symbol matches any path of length one or
more. The use of followed by a single label is
a convenient and common way to locate objects
with a certain label in complex structure.
Similar to ?, two occurrences of in the same
query must match the same sequence of labels,
unless variables are used.
ltanswer, set, obj1gt
obj1 is lttopic, string, Databasesgt

78
Queries - variables

Retrieve each document for which both
Hopcroft and Aho are co-authors
SELECT biblio.doc
FROM root
WHERE biblio.doc.auth-set.auth-ln(a1)Aho and
biblio.doc.auth-set.auth-ln(a1)H
opcroft
Here, the query finds all the paths with
structure biblio, doc, auth-set, and with two
distinct path completions with label auth with
values Aho and Hopcroft
ltanswer, set, obj1gt
obj1 is the complete doc2

79
An OEM Database
DBGroup
1
Member
Project
Member
Member
Project
Member
2
3
4
5
6
Name
Project
Name
Office
Project
Age
Name
Age
Office
Office
9
11
8
10
12
13
14
7
15
16
Clark
Smith
46
Gates 252
Lore
Tsimmis
Jones
28
Room
Building
Room
Building
17
18
19
20
CIS
411
CIS
252
80
Lorel Queries - Simple Path Expression

Retrieve the offices of members with age greater
than 30 years
Query SELECT DBGroup.Member.Office
WHERE DBGroup.Member.Age gt 30
Result Office Gates 252
Office
Building CIS
Room 411

81
Queries - General Path Expression

Query SELECT DBGroup.Member.Name
WHERE DBGroup.Member.Office(.Room.Cubicle)?
Like 252
Result Name Jones
Name Smith
Room matches all labels starting from Room, like
Room68. stands for disjunction. ? indicates
that the label pattern is optional. like 252
specifies that the data value should end with
string 252.

82
Queries - SubQueries
Retrieve Lore project members who work on other
projects Query SELECT M.Name, ( SELECT
M.Project.Title WHERE M.Project.Title !
Lore) FROM DBGroup.Member M WHERE
M.Project.Title Lore Result Member Name
Jones Title Tsimmis
83
Lore - Summary

Lore does facilitate query and updates on
semi-structural databases
There has been more work done on optimization
using data guides (vldb97).
The system is up and running http//WWW-DB.Stanfo
rd.EDU/lore/demo/
How is this related to WWW?
XML-QL and related work provides the answer.

84
Extraction and Integration

OEM and subsequent LORE(L) can be used for
extracting information from multiple information
sources.
OEM helps navigate through unknown objects by
SELECT ?
FROM root
Thus help browsing and schema discovery
Efficient implementations are possible using
partial fetch mechanism.
Push and Pull information delivery systems are
possible.
How is this different from WebIR?

85
STRUDEL

Web Site Management System
web Site from multiple sources
STruQL - based on OEM, graphs, regular
expressions, result as graph
Example - return all the postscript papers from
homepages
Where homepages(p), p paper q
ispostscript(q) collect postscriptpages(p)
Where C1,...Ck Create N1,...Nn link L1,...Lp,
Collect G1, Gq

86
Complex Constructors
Supported by Strudel a Website Management System
with StruQL as query language where Biblio(X),
X -gt paper -gt P, P -gt author -gtA, P -gt
title -gt T, P -gt year -gt Y create Root(),
HomePage(A), YearPage(A,Y), PubPage(P) link Root()
-gt person -gt HomePage (A), HomePage(A)
-gtyearentry -gt YearPage(A,Y), YearPage(A,Y) -gt
publication -gt PubPage(P), PubPage(P) -gt
author -gt HomePage(A), PubPage(P) -gt title
-gtT
87
WebDB

View WWW as multimedia documents in the form of
web pages
WQL supports selection, aggregation, sorting,
summary, grouping
projection on title , URL, keywords, tables,
forms, images etc.

88
Some More Results

UnQL - ATT
AKIRA- Pennstate
NoDose - SIGMOD98

89
HTML to XML

HTML documents
Emerging Web Standards - XML
XML good for data interchange across platforms
enterprise wide
conversion HTML to XML - IBM, Microsoft

90
XML - Motivation

In HTML, both the tag semantics and tags are
fixed. There is limited and strict interpretation
of tags.
HTML is widely successful in disseminating
documents across internet.
Though data can be disseminated through HTML, its
extraction is painful, and laborious.
EDI has been a predominate mode of exchanging
data among businesses. But it has very rigid
format that requires highly customized
applications.

91
XML - Introduction

XML aims to provide ease of authoring HTML
documents with ease of data exchange that is
possible with EDI.
Tags are used to markup documents.
XML is a meta-language for describing markup
languages.
XML provides a facility to define tags and
structural relationships between them.
No pre-defined tag set implied no preconceived
semantics, semantics of XML document is will be
defined by applications that process them or
style sheets (XSL).

92
XML - Goals

Straightforward to use over internet
Support wide variety of applications, authoring,
browsing, content analysis, etc.
Easy to write programs that process XML documents
and validate them.
XML documents must be human-legible and
reasonably clear.
Design of XML shall be formal and concise -
expressed as EBNF (extended Backus Naur Form) -
amenable to modern compiler tools and techniques.

93
XML-features

Some structure - not rigid
Extensibility - User defined tags
nested elements
validation - documents may specify their own
grammar
DTP (Document Type Descriptor) - schema exists
with data as tag names
Application -EDI - extraction, conversion, ,
transformation, integration
can be modeled using DOM

94
More terminology

RDF - Resource Description Framework - a method
to describe metdata for XML documents
XSL - Extensible Stylesheet Language - language
for transforming and formatting XML.
Transformation Language - XSLT, XPath, XPointer

95
Example-HTML

Print - Sanjay Madria
Web Warehouse Tutorial, ADBIS99
HTML
ltH2gt Sanjay Madria lt/H2gt
ltIgt Web Warehouse Tutorial, ADBIS99lt/Igt
Very difficult to understand, structure is
hidden, describes only appearance

96
XML

ltRefgt
ltSpeakergt ltFirstnamegt Sanjaylt/firstnamegt
ltLastnamegt Madrialt/lastnaamegt
lt/Speakergt
ltTitle gt Web Warehouse Tutoriallt/Titlegt
ltConferencegt ADBIS99lt/Conferencegt
lt/emptygt
lt/Refgt
another format
ltFirstname Value Sanjay/gt

97
XML Data

ltbookgt
lttitlegt database systemslt/titlegt
ltauthorgt John ltlastnamegt Korthlt/lastnamegtlt/autho
rgt
ltprice currency USDgt 5.87lt/pricegt
lt/bookgt
DTD
lt!ELEMENT book (title, author, price)gt
lt!ELEMENT title (PCDATA)gt
lt!ELEMENT author(PCDATA)lastname)

lttrgt lttd width"20" valign"top"gt Firma
Karl-Heinz Rosowski lt/tdgt
lttd width"20" valign"top"gt Maikstraße 14 lt/tdgt
lttd width"20" valign"top"gt 22041 Hamburg lt/tdgt
lttd width"20" valign"top"gt 721 99 64 lt/tdgt
lttd width"20" valign"top"gt 21110111 lt/tdgt
lt/trgt

HTML Version

lt?xml version"1.0"?gt
ltAddressesgt
ltAddress id"12359"gt
ltNamegtFirma Karl-Heinz Rosowskilt/Namegt
ltStreetgtMaikstraße 14lt/Streetgt
ltZIPgt22041lt/ZIPgt
ltCitygtHamburglt/Citygt
ltTelgt721 99 64lt/Telgt
ltFaxgt21110111lt/Faxgt ltEmail/gt
lt/Addressgt
lt/Addressesgt

XML Version
99
XML - Document - Continued

lt?xml version"1.0"?gt is the XML declaration.
ElementsMost common form of markup. ltelementgt
lt/elementgt. For example ltnamegtJack Lemon lt/namegt
Attributes are name-value pairs that occur
inside start-tags after the element name. For
example ltAddress id"12359"gt attaches value
12359 to attribute id of Address element.
Entity References to handle special characters
of XML like lt in the XML documents.

100

Comments lt!-- this is a comment --!gt
CDATA Sections a CDATA (string of characters)
section instructs the parser to ignore most
markup characters. For example source code,
lt!CDATA p q b (I lt 3)gt, between
CDATA and all character data is passed to an
application, with out interpretation.

101
XML - DTD - Element Type Declarations

Element type declarations identify the names of
elements and the nature of their content. A
typical element type declaration looks like
lt!Element Address (Name, Street, ZIP?, City,
Tel, Fax, Email?)gt
Address is the element name, and (Name, Street,
ZIP?, City, Tel, Fax, Email?) is the content
model. Every address must contain, Name, Street,
City and Tel. ZIP and Email are optional, whereas
there can be zero or more Fax numbers.

102

The declarations for Name, Street, ZIP , must
also be given. For example
lt!Element Name (PCDATA)gt
Attribute List Declarations identify which
elements may have attributes, what values the
attributes may hold, and what value is default.
Attribute values appear only within start-tags
and empty-element tags.
ltAddress id"12359"gt

103
XML - Summary

HTML describes presentation
XML describes content
XML vs. HTML
users define new tags
arbitrary nesting
validation is possible

104
XML and Semi Structural Data Model

XML data is fundamentally different than
relational and object oriented data.
XML is not rigidly structured.
In relational and OO data model every data
instance has a schema which is separate and
independent of the data.
XML data is self describing and can naturally
model irregularities that cannot be modeled by
relational or OO data model.

105

For example, data items may have missing elements
or multiple occurrences of the same element
elements may have atomic values in some data
items and structured values in others and
collections of elements can have heterogeneous
structure.
Even XML data that has an associated DTD is
self-describing (the schema is always stored
with the data) and, except for very restricted
forms of DTDs, may have all the irregularities
described above.
XML is an instance of semistructured data.

106
XML-QL

Regular path expression
pattern matching
used edge labeled graphs
extract data from existing XML documents and
construct new XML documents
support for ordered and unordered views on XML
document
simple and declarative

107
XML-QL

The simplest XML-QL queries extract data from an
XML document. Consider the following DTD
lt!ELEMENT book (author,title,publisher)gt
lt!ATTLIST Book year CDATAgt
lt!ELEMENT article (author title year?,
(shortversion longversion))gt
lt!ATTLIST article type CDATAgt
lt!ELEMENT publisher (name, address)gt
lt!ELEMENT author (firstname?, lastname)gt

108
XML-QL Example Data
ltbibgt ltbook year1995gt lttitlegt An
Introduction to DB Systems lt/titlegt ltauthorgt
ltlastnamegt Date lt/lastnamegtlt/authorgt ltpublishergt
ltnamegt Addison-Wesleylt/namegt lt/publishergt lt/bookgt
ltbook year1995gt lttitlegt Foundations for
OR Databases lt/titlegt ltauthorgt ltlastnamegt Date
lt/lastnamegtlt/authorgt ltauthorgt ltlastnamegt
Darwen lt/lastnamegtlt/authorgt ltpublishergtltnamegt
Addison-Wesleylt/namegt lt/publishergt lt/bookgt lt/bibgt
109
Matching Data Using Patterns

XML uses element patterns to match data in an XML
document.
Find all authors of books whose publisher is
Addison-Wesley in XML document www.a.b.c/bib.xml
WHERE ltbookgt
ltpublishergtltnamegtAddison-Wesleylt/namegtlt/publishe
rgt
lttitlegt t lt/titlegt
ltauthorgt a lt/authorgt
lt/bookgt IN www.a.b.c/bib.xml
CONSTRUCT a
matches every ltbookgt element in the XML document
that has at least one lttitlegt element, one
ltauthorgt element , and one publisher element
whose ltnamegt is Addison-Wesley. For each such
match it binds t and a to every title and
author pair.

110
XML-QL Constructing XML Data

Often we would like format the result.
Find all authors and titles of books whose
publisher is Addison-Wesley in XML document
www.a.b.c/bib.xml
WHERE ltbookgt
ltpublishergtltnamegtAddison-Wesleylt/gtlt/gt
lttitlegt t lt/titlegt
ltauthorgt a lt/authorgt
lt/bookgt IN www.a.b.c/bib.xml
CONSTRUCT ltresultgt
ltauthorgt a lt/gt
lttitlegt t lt/gt
lt/gt

111
Constructing XML Data -cont.
Result of the query ltresultgt ltauthorgtltlastname
gt Date lt/lastnamegtlt/authorgt lttitlegt
Introduction to Database Systems
lt/titlegt lt/resultgt ltresultgt ltauthorgtltlastnamegt
Date lt/lastnamegtlt/authorgt lttitlegt Foundations
for OR Databases lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Darwen lt/lastnamegtlt/authorgt ltt
itlegt Foundations for OR Databases
lt/titlegt lt/resultgt One result for each author,
duplicating title information.
112
XML-QL Nested Queries.
WHERE ltbookgt lttitlegt t lt/gt ltpublishergtltname
gtAddison-Wesleylt/gtlt/gt lt/gt CONTENT_AS p IN
www.a.b.c/bib.xml CONSTRUCT ltresultgt lttitle
gt t lt/gt WHERE ltauthorgt a lt/gt in
p CONSTRUCT ltauthorgt a lt/gt
lt/gt ltresultgt ltauthorgtltlastnamegt Date
lt/lastnamegtlt/authorgt lttitlegt Introduction to
Database Systems lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Date lt/lastnamegtlt/authorgt ltaut
horgtltlastnamegt Darwen lt/lastnamegtlt/authorgt lttitl
egt Foundations for OR Databases
lt/titlegt lt/resultgt
113
XML-QL Join Queries
XML queries cab express joins by matching two
or more elements that contain same value. Find
all articles that have at least one author who
has written a book since 1995. WHERE ltarticlegt
ltauthorgt ltfirstnamegt f lt/gt //
firstname f ltlastnamegt l lt/gt //
lastname l lt/gt lt/gt CONTENT_AS a
IN "www.a.b.c/bib.xml" ltbook yearygt
ltauthorgt ltfirstnamegt f lt/gt //
join on same firstname f ltlastnamegt
l lt/gt // join on same lastname l lt/gt
lt/gt IN "www.a.b.c/bib.xml", y gt
1995 CONSTRUCT ltarticlegt a lt/gt
114
XML-QL Data Model for XML

XML graph G in which each node is represented by
a unique string called object identifier (OID),
Gs edges are labelled with element tags, Gs
nodes are labeled with sets of attribute value
pairs, Gs leaves are labeled with one string
value, and G has a distinguished node called
root.

115
XML-QL Data Model for XML

The model allows several edges between the same
two nodes with the following restriction
between any two nodes there can be at most one
edge with a given label
a node cannot have two leaf children with the
same label and same string value
XML graphs are not only derived from XML
documents, but are also generated by queries.

116
XML- Element Identity, Ids, and IDREFS

For element sharing XML reserves an attribute of
type ID which allows a unique key to be
associated with an element.
An attribute of type IDREF allows an element to
refer to another element with the designated key,
and one of the type IDREFS may refer to multiple
elements.

117

lt!ATTLIST person ID REQUIREDgt
lt!ATTLIST article author IDREFS IMPLIEDgt
ltperson ID"o123"gt
ltfirstnamegtJohnlt/firstnamegt
ltlastnamegtSmithltlastnamegt
lt/persongt
ltperson ID"o234"gt
. . .
lt/persongt
ltarticle author"o123 o234"gt
lttitlegt ... lt/titlegt
ltyeargt 1995 lt/yeargt
lt/articlegt

118
XML- Element Identity, Ids, and IDREFS
119
The following query produces all lastname, title
pairs by joining the author element's IDREF
attribute value with the person element's ID
attribute value. WHERE ltarticle authorigt
lttitlegt lt/gt ELEMENT_AS t
lt/gt, ltperson IDigt
ltlastnamegt lt/gt ELEMENT_AS l
lt/gt CONSTRUCT ltresultgt t llt/gt The idiom
lttitlegtlt/gt ELEMENT_AS t binds t to a lttitlegt
element with arbitrary contents. The element
expression lttitle/gt matches a lttitlegt element
with empty contents.
120
XML-QL- Advanced Examples
Tag Variables Regular Path Expressions Transformin
g XML Data (from one DTD to another) Integrating
Data from different XML sources Embedding queries
in data XML-QL check http//www3.org/TR/NOTE-xml
-ql
121
Summary

Even before you blink your eye. Lot of work has
gone in web data models and query languages
Some problems are addressed
Semi-structural
semi-structural data model based query languages
schema inference from semi-structural data model
efficient processing of queries on
semi-structural data
efficient indexing and storage structures
integration with XML
Traditional
WebSQL/WebOQL
Web Warehousing
Which way will you go?

122
Further issues

Distributed query processing
Continuous result processing with push/pull
result replenishment
Labels, labels every where, with XML more labels
every where how are semantics of queries across
multiple information sources handled
IR gives too many relevant/irrelevant results
Query Processing requires some schema knowledge
that is difficult to handle across multiple
sources
Can these two be bridged? Cooperative solutions.
Next Agents, Agents everywhere, What are they
doing? Will it work or Will it be a fad?