Web Data Management - PowerPoint PPT Presentation

About This Presentation

Title:

Web Data Management

Description:

No efficient document management. Query results cannot be further manipulated ... Language - good for publishing deeply structured document ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 67

Provided by: sanjay70

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Data Management

1
Class Number CS 412
Web Data MGMT and XML
Instructor Sanjay Madria
Lesson Title - Introduction
2

The link for the Real Player live stream for the
is
http//movie.umr.edu/ramgen/encoder/liveCS412F03.r
m
The link to view the archived Real Player lecture
at 28 and 56 kbs is
http//movie.umr.edu/ramgen/CoursesF02/CS412F03/CS
412Lec082803kbs2856.rm
(The lecture date section 082803 will change for
each produced class)
The link to view the Real Player archived lecture
at 200 kbs is http//movie.umr.edu/ramgen/Courses
F03/CS412F03/CS412Lec082803kbs200.rm
For example, to watch the lecture using real
player for say 15th Sept, you modify the date as
CS412Lec091503kbs200.rm

3
Web Data Management and XML

Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla
madrias_at_umr.edu

4
WWW

Huge, widely distributed, heterogeneous
collection of semi-structured multimedia
documents in the form of web pages connected via
hyperlinks.

5
World Wide Web

Web is fast growing
More business organizations putting information
in the Web
Business on the highway
Myriad of raw data to be processed for information

6
As WWW grows, more chaotic it becomes

Web is fast growing, distributed,
non-administered global information resource
WWW allows access to text, image, video, sound
and graphic data
More business organizations creating web servers
More chaotic environment to locate information of
interest
Lost in hyperspace syndrome

7
Characteristics of WWW

WWW is a set of directed graphs
Data in the WWW has a heterogeneous nature,
self-describing and schema less
Unstructured information , deeply nested
No central authority to manage information
Dynamic verses static information
Web information discoveries - search engines

8
Web is Growing!

In 1994, WWW grew by 1758 !!
June 1993 - 130
June 1994 - 1265
Dec. 1994 - 11,576
April 1995 - 15,768
July 1995 - 23,000
2000 - !!!!!

9
COM domains are increasing!

As of July 1995, 6.64 million host computers on
the Internet
1.74 million are com domains
1.41 million are edu domains
0.30 million are net
0.27 million are gov
0.22 million are mil
0.20 million are org

10
The number of Internet hosts exceeded...

1000 in 1984
10000 in 1987
100000 in 1989
1.000.000 in 1992
10.000.000 in 1996
100.000.000 in 2000

11
Top web countries

1. Canada (1) 80 9. New Zealand(7)101
2. US (4) 140 10. Sweden (9) 101
3. Ireland (3) 110 11. Israel (12) 112
4. Iceland (2) 68 12. Cyprus (8) 72
5. UK (14) 336 13. Hong Kong (15)148
6. Malta (5) 155 14. Norway (10) 64
7. Australia (6) 133 15. Switzerland (13) 75
8. Singapore (11) 207 16. Denmark (16) 105

12
How users find web sites

Indexes and search engines 75
UseNet newsgroups 44
Cool lists 27
New lists 24
Listservers 23
Print ads 21
Word-of-mouth and e-mail 17
Linked web advertisement 4

13
Limitations of Search Engines

Do not exploit hyperlinks
Search is limited to string matching
Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
Low accuracy
Replicated results
No further manipulation possible

14
Limitations of Search Engines

ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery

15
More PROBLEMS

Specifying/understanding what information is
wanted
High degree of variability of accessible
information
Variability in conceptual vocabulary or
ontology used to describe information
Complexity of querying unstructured data

Complexity of querying structured data
Uncontrolled nature of web-based information
content
Determining which information sources to
search/query

Search Engine Capabilities
Selection of language
Keywords with disjunction, adjacency, presence,
absence, ...
Word stemming (Hotbot)
Similarity search (Excite)
Natural language (LycosPro)
Restrict by modification date (Hotbot) or range
of dates (Alta Vista)
Restrict result types (e.g., must include images)
(Hotbot)
Restrict by geographical source (content or
domain) (Hotbot)
Restrict within various structured regions of a
document (titles or URLs) (Lycos Pro) (summary,
first heading, title, URL) (Opentext)

18
SEARCH RETRIEVAL

Search Engines

Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3

using several search engines is better than
using only one
Source Lawrence, S., and Giles, C.L., Searching
the World Wide Web, Science 280, pp. 98-100,
1998.

19
Schemes to locate information

Supervised links between sites
ask at the reference desk
Classification of documents
search in the catalog
Automated searching
wander around the library

20
The most popular search engines

Year 2000
AltaVista
Yahoo
HotBot

Year 2001
Google
NorthernLight
AltaVista

21
Boolean search in AltaVista
22
Specifying field content in HotBot
23
Natural language interface in AskJeeves
24
Three examples of search strategies

Rank web pages based on popularity
Rank web pages based on word frequency
Match query to an expert database
All the major search engines use a mixed
strategy in ranking web pages and responding to
queries

25
Rank based on word frequency

Library analogue Keyword search
Basic factors in HotBot ranking of pages
words in the title
keyword meta tags
word frequency in the document
document length

26
Alternative word frequency measures

Excite uses a thesaurus to search for what you
want, rather than what you ask for
AltaVista allows you to look for words that occur
within a set distance of each other
NorthernLight weighs results by search term
sequence, from left to right

27
Rank based on popularity

Library analogue citation index
The Google strategy for ranking pages
Rank is based on the number of links to a page
Pages with a high rank have a lot of other web
pages that link to it
The formula is on the Google help page ?

28
More on popularity ranking

The Google philosophy is also applied by others,
such as NorthernLight
HotBot measures the popularity of a page by how
frequently users have clicked on it in past
search results

29
Expert databases Yahoo!

An expert database contains predefined responses
to common queries
A simple approach is subject directory, e.g. in
Yahoo!, which contains a selection of links for
each topic
The selection is small, but can be useful
Library analogue Trustworthy references

30
Expert databases AskJeeves

AskJeeves has predefined responses to various
types of common queries
These prepared answers are augmented by a
meta-search, which searches other SEs
Library analogue Reference desk

31
Best wines in France AskJeeves
32
Best wines in France HotBot
33
Best wines in France Google
34
Linux in Iceland Google
35
Linux in Iceland HotBot
36
Linux in Iceland AskJeeves
37
Web Data Management is the Key
38
Key Objectives

Design a suitable data model to represent web
information
Development of web algebra and query language,
query optimization
Maintenance of Web data - View Maintenance
Development of knowledge discovery and web mining
tools
Web warehouse
Web data integration , secondary storages,
indexes

39
Limitations of the Web Today

Applications can not consume HTML
HTML wrapper technology is brittle
Companies merge , need interoperability fast

40
Paradigm Shift

New Web standards XML
XML generated by applications and consumed by
applications
Data exchange
Across platforms enterprise interoperability
Across enterprises
Web from documents to data

41
Database challenges

Query optimization and processing
Views and transformations
Data warehousing and data integration
Mediators and query rewriting
Secondary storages
indexes

42
DBMS needs paradigm shift to

Web data differs from database data
self describing, schema less
structure changes without notice
heterogeneous, deeply nested, irregular
documents and data mixed
Designed by document, but not db expert
Need web data mgmt

43
Web Data Representation

HTML - Hypertext Markup Language
fixed grammar, no regular expressions
Simple representation of data
good for simple data and intended for human
consumption
difficult to extract information
SGML - Standard Generalized Markup
Language - good for publishing deeply structured
document
XML - Extended Markup Language -a subset of SGML

44
Terminology

HTML - Hypertext Mark-up Language
HTTP - Hypertext Transmission Protocol
URL - Uniform Resource Locator
example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
megtltlocationgt where
ltprotocolgt is http, ftp, gopher
host is internet address
location is a textual label in the file.

Links are specified as
ltA HREFDestination URLgtAnhor Textlt/Agt
destination URL is the URL of the destination
document and Anchor Text is the text that appears
as an anchor when displayed.
Example
ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
Technological Universitylt/Agt
Absolute and relative
URL ltA HREF"AtlanticStates/NYStats.html"gtNew
Yorklt/Agt is relative
ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
/ WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
to HTMLlt/Agt absolute address

46
World Wide Web

Prevalent, persistent and informative

HTML documents (soon, XML) created by humans or
applications.

Accessed day in and day out by humans and
applications.

Persistent HTML documents!!!

Can database technology help?
47
Current Research Projects

Web Query System
W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog, Araneus
Semistructured Data Management
LOREL, UnQL, WebOQL, Florid
Website Management System
STRUDEL, Araneus
Web Warehouse
WHOWEDA, Xylem.com

48
Main Tasks

Modeling and Querying the Web
view web as directed graph
content and link based queries
example - find the page that contain the word
clinton which has a link from a page containing
word monica.

Information Extraction and integration
wrapper - program to extract a structured
representation of the data a set of tuples from
HTML pages.
Mediator - integration of data-softwares that
access multiple source from a uniform interface
Web Site Construction and Restructuring
creating sites
modeling the structure of web sites
restructuring data

50
What to Model

Structure of Web sites
Internal structure of web pages
Contents of web sites in finer granularities

51
Data Representation of Web Data

Graph Data Models
Semistructured Data Models (also graph based)

52
Graph Data Model

Labeled graph data model where node represents
web pages and arcs represent links between pages.
Labels on arcs can be viewed as attribute names.
Regular path expression queries

53
Semistructured Data Models

Irregular data structure, no fixed schema known
and may be implicit in the data
Schema may be large and may change frequently
Schema is descriptive rather than perspective
describes the current state of data, but
violations of schema is still tolerated

Data is not strongly typed for different objects
the values of the same attributes may be of
differing types. (heterogenious sources)
No restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes
Ability to query the schemas acr variables which
get bound to labels on arcs, rather than nodes in
the graph

55
Graph based Query Languages

Use graph to model databases
Support regular path expressions and graph
construction in queries.
Examples
Graph Log for hypertext queries
graph query language for OO

56
Query Languages for Semi-Structured data

Use labeled graphs
Query the schema of data
Ability to accommodate irregularities in the
data, such as missing links etc.
Examples Lorel (Stanford) , UnQL (ATT), STRUQL
(ATT)

57
Comparison of Query Systems
58
Types of Query Languages

First Generation
Second generation

59
First Generation Query Languages

Combine the content-based queries of search
engines with structure-based queries
Combine conditions on text pattern in documents
with graph pattern describing link structures
Examples - W3QL (TECHNION, Israel)
WebSQL (Toronto), WebLOG (Concordia)

60
Second generation languages

Called web data manipulation languages
Web pages as atomic objects with properties that
they contain or do not contain certain text
patterns and they point to other objects
Useful for data wrapping, transformation, and
restructuring
Useful for web site transformation and
restructuring
Access to internal structure of web pages, it
helps in extracting a set of tuples from the web
pages of a movie database which requires parsing
and selectively access certain subtrees in the
parse tree

61
How they Differ?

Provide access to the structure of web objects
they manipulate - return structure
Model internal structures of web documents as
well as the external links that connect them
Support references to model hyperlinks and some
support to ordered collections of records for
more natural data representation
Ability to create new complex structures as a
result of a query

62
Examples

Web OQL
STRUQL
Florid

63
Information Integration

To answer queries that may require extracting and
combining data from multiple web sources
Example - Movie database data about movies,
their start casts, directors, schedule etc.
Give me a movie playing time and a review of
movies starring Frank Sinatra, playing tonight in
Paris

64
Approaches

Web warehouse Data from multiple web sources is
loaded into a warehouse, all queries are applied
to warehouse data
Advantage - Warehouse needs to be updated when
data sources change
Disadvantage - Performance Improvement
Virtual warehouse Data remain in the web
sources, queries are decomposed at run time into
queries to sources
Data is not replicated and is fresh
Due to autonomy of web sources query optimization
and execution methodology mat differ and
performance may be affected
Good when the number of sources are large, data
changes frequently, little control over web
sources

65
Virtual approach verses DBMS

In virtual approach, data is not communicated
directly with storage manager, instead it
communicates to wrappers
Second, user does not pose queries directly in
the schema in which data is stored, user is free
from knowing the structure
User pose the queries to mediated schema, virtual
relations (not stored anywhere) designed for
particular application

66
Steps in data integration

Specification of mediated schema and
reformulation Mediated schema is the set of
collection and attribute names needed to
formulate queries
Data integration system translates the query on
the mediated schema into a query to data source
Completeness of data in web sources
Differing query processing capabilities
Query Optimization selecting a set of minimal
sources and minimal queries
Wrapper construction
Matching objects across sources