Title: Web Data Management
1Class Number CS 412
Web Data MGMT and XML
Instructor Sanjay Madria
Lesson Title - Introduction
2- The link for the Real Player live stream for the
is - http//movie.umr.edu/ramgen/encoder/liveCS412F03.r
m - The link to view the archived Real Player lecture
at 28 and 56 kbs is - http//movie.umr.edu/ramgen/CoursesF02/CS412F03/CS
412Lec082803kbs2856.rm - (The lecture date section 082803 will change for
each produced class) - The link to view the Real Player archived lecture
at 200 kbs is http//movie.umr.edu/ramgen/Courses
F03/CS412F03/CS412Lec082803kbs200.rm - For example, to watch the lecture using real
player for say 15th Sept, you modify the date as
CS412Lec091503kbs200.rm
3Web Data Management and XML
- Sanjay Kumar Madria
- Department of Computer Science
- University of Missouri-Rolla
- madrias_at_umr.edu
4WWW
- Huge, widely distributed, heterogeneous
collection of semi-structured multimedia
documents in the form of web pages connected via
hyperlinks.
5World Wide Web
- Web is fast growing
- More business organizations putting information
in the Web - Business on the highway
- Myriad of raw data to be processed for information
6As WWW grows, more chaotic it becomes
- Web is fast growing, distributed,
non-administered global information resource - WWW allows access to text, image, video, sound
and graphic data - More business organizations creating web servers
- More chaotic environment to locate information of
interest - Lost in hyperspace syndrome
7Characteristics of WWW
- WWW is a set of directed graphs
- Data in the WWW has a heterogeneous nature,
self-describing and schema less - Unstructured information , deeply nested
- No central authority to manage information
- Dynamic verses static information
- Web information discoveries - search engines
8Web is Growing!
- In 1994, WWW grew by 1758 !!
- June 1993 - 130
- June 1994 - 1265
- Dec. 1994 - 11,576
- April 1995 - 15,768
- July 1995 - 23,000
- 2000 - !!!!!
9COM domains are increasing!
- As of July 1995, 6.64 million host computers on
the Internet - 1.74 million are com domains
- 1.41 million are edu domains
- 0.30 million are net
- 0.27 million are gov
- 0.22 million are mil
- 0.20 million are org
10The number of Internet hosts exceeded...
- 1000 in 1984
- 10000 in 1987
- 100000 in 1989
- 1.000.000 in 1992
- 10.000.000 in 1996
- 100.000.000 in 2000
11Top web countries
- 1. Canada (1) 80 9. New Zealand(7)101
- 2. US (4) 140 10. Sweden (9) 101
- 3. Ireland (3) 110 11. Israel (12) 112
- 4. Iceland (2) 68 12. Cyprus (8) 72
- 5. UK (14) 336 13. Hong Kong (15)148
- 6. Malta (5) 155 14. Norway (10) 64
- 7. Australia (6) 133 15. Switzerland (13) 75
- 8. Singapore (11) 207 16. Denmark (16) 105
12How users find web sites
- Indexes and search engines 75
- UseNet newsgroups 44
- Cool lists 27
- New lists 24
- Listservers 23
- Print ads 21
- Word-of-mouth and e-mail 17
- Linked web advertisement 4
13Limitations of Search Engines
- Do not exploit hyperlinks
- Search is limited to string matching
- Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
- Low accuracy
- Replicated results
- No further manipulation possible
14Limitations of Search Engines
- ERROR 404!
- No efficient document management
- Query results cannot be further manipulated
- No efficient means for knowledge discovery
15More PROBLEMS
- Specifying/understanding what information is
wanted - High degree of variability of accessible
information - Variability in conceptual vocabulary or
ontology used to describe information - Complexity of querying unstructured data
16- Complexity of querying structured data
- Uncontrolled nature of web-based information
content - Determining which information sources to
search/query
17- Search Engine Capabilities
- Selection of language
- Keywords with disjunction, adjacency, presence,
absence, ... - Word stemming (Hotbot)
- Similarity search (Excite)
- Natural language (LycosPro)
- Restrict by modification date (Hotbot) or range
of dates (Alta Vista) - Restrict result types (e.g., must include images)
(Hotbot) - Restrict by geographical source (content or
domain) (Hotbot) - Restrict within various structured regions of a
document (titles or URLs) (Lycos Pro) (summary,
first heading, title, URL) (Opentext)
18SEARCH RETRIEVAL
Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3
- using several search engines is better than
using only one - Source Lawrence, S., and Giles, C.L., Searching
the World Wide Web, Science 280, pp. 98-100,
1998.
19Schemes to locate information
- Supervised links between sites
- ask at the reference desk
- Classification of documents
- search in the catalog
- Automated searching
- wander around the library
20The most popular search engines
- Year 2000
- AltaVista
- Yahoo
- HotBot
- Year 2001
- Google
- NorthernLight
- AltaVista
21Boolean search in AltaVista
22Specifying field content in HotBot
23Natural language interface in AskJeeves
24Three examples of search strategies
- Rank web pages based on popularity
- Rank web pages based on word frequency
- Match query to an expert database
- All the major search engines use a mixed
strategy in ranking web pages and responding to
queries
25Rank based on word frequency
- Library analogue Keyword search
- Basic factors in HotBot ranking of pages
- words in the title
- keyword meta tags
- word frequency in the document
- document length
26Alternative word frequency measures
- Excite uses a thesaurus to search for what you
want, rather than what you ask for - AltaVista allows you to look for words that occur
within a set distance of each other - NorthernLight weighs results by search term
sequence, from left to right
27Rank based on popularity
- Library analogue citation index
- The Google strategy for ranking pages
- Rank is based on the number of links to a page
- Pages with a high rank have a lot of other web
pages that link to it - The formula is on the Google help page ?
28More on popularity ranking
- The Google philosophy is also applied by others,
such as NorthernLight - HotBot measures the popularity of a page by how
frequently users have clicked on it in past
search results
29Expert databases Yahoo!
- An expert database contains predefined responses
to common queries - A simple approach is subject directory, e.g. in
Yahoo!, which contains a selection of links for
each topic - The selection is small, but can be useful
- Library analogue Trustworthy references
30Expert databases AskJeeves
- AskJeeves has predefined responses to various
types of common queries - These prepared answers are augmented by a
meta-search, which searches other SEs - Library analogue Reference desk
31Best wines in France AskJeeves
32Best wines in France HotBot
33Best wines in France Google
34Linux in Iceland Google
35Linux in Iceland HotBot
36Linux in Iceland AskJeeves
37Web Data Management is the Key
38Key Objectives
- Design a suitable data model to represent web
information - Development of web algebra and query language,
query optimization - Maintenance of Web data - View Maintenance
- Development of knowledge discovery and web mining
tools - Web warehouse
- Web data integration , secondary storages,
indexes
39Limitations of the Web Today
- Applications can not consume HTML
- HTML wrapper technology is brittle
- Companies merge , need interoperability fast
40Paradigm Shift
- New Web standards XML
- XML generated by applications and consumed by
applications - Data exchange
- Across platforms enterprise interoperability
- Across enterprises
- Web from documents to data
41Database challenges
- Query optimization and processing
- Views and transformations
- Data warehousing and data integration
- Mediators and query rewriting
- Secondary storages
- indexes
42DBMS needs paradigm shift to
- Web data differs from database data
- self describing, schema less
- structure changes without notice
- heterogeneous, deeply nested, irregular
- documents and data mixed
- Designed by document, but not db expert
- Need web data mgmt
43Web Data Representation
- HTML - Hypertext Markup Language
- fixed grammar, no regular expressions
- Simple representation of data
- good for simple data and intended for human
consumption - difficult to extract information
- SGML - Standard Generalized Markup
- Language - good for publishing deeply structured
document - XML - Extended Markup Language -a subset of SGML
44Terminology
- HTML - Hypertext Mark-up Language
- HTTP - Hypertext Transmission Protocol
- URL - Uniform Resource Locator
- example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
megtltlocationgt where - ltprotocolgt is http, ftp, gopher
- host is internet address
- location is a textual label in the file.
45 - Links are specified as
- ltA HREFDestination URLgtAnhor Textlt/Agt
- destination URL is the URL of the destination
document and Anchor Text is the text that appears
as an anchor when displayed. - Example
- ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
Technological Universitylt/Agt - Absolute and relative
- URL ltA HREF"AtlanticStates/NYStats.html"gtNew
Yorklt/Agt is relative - ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
/ WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
to HTMLlt/Agt absolute address
46World Wide Web
- Prevalent, persistent and informative
- HTML documents (soon, XML) created by humans or
applications.
- Accessed day in and day out by humans and
applications.
- Persistent HTML documents!!!
Can database technology help?
47Current Research Projects
- Web Query System
- W3QS, WebSQL, AKIRA, NetQL, RAW,
- WebLog, Araneus
- Semistructured Data Management
- LOREL, UnQL, WebOQL, Florid
- Website Management System
- STRUDEL, Araneus
- Web Warehouse
- WHOWEDA, Xylem.com
48Main Tasks
- Modeling and Querying the Web
- view web as directed graph
- content and link based queries
- example - find the page that contain the word
clinton which has a link from a page containing
word monica.
49 - Information Extraction and integration
- wrapper - program to extract a structured
representation of the data a set of tuples from
HTML pages. - Mediator - integration of data-softwares that
access multiple source from a uniform interface - Web Site Construction and Restructuring
- creating sites
- modeling the structure of web sites
- restructuring data
50What to Model
- Structure of Web sites
- Internal structure of web pages
- Contents of web sites in finer granularities
51Data Representation of Web Data
- Graph Data Models
- Semistructured Data Models (also graph based)
52Graph Data Model
- Labeled graph data model where node represents
web pages and arcs represent links between pages. - Labels on arcs can be viewed as attribute names.
- Regular path expression queries
53Semistructured Data Models
- Irregular data structure, no fixed schema known
and may be implicit in the data - Schema may be large and may change frequently
- Schema is descriptive rather than perspective
describes the current state of data, but
violations of schema is still tolerated
54 - Data is not strongly typed for different objects
the values of the same attributes may be of
differing types. (heterogenious sources) - No restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes - Ability to query the schemas acr variables which
get bound to labels on arcs, rather than nodes in
the graph
55Graph based Query Languages
- Use graph to model databases
- Support regular path expressions and graph
construction in queries. - Examples
- Graph Log for hypertext queries
- graph query language for OO
56Query Languages for Semi-Structured data
- Use labeled graphs
- Query the schema of data
- Ability to accommodate irregularities in the
data, such as missing links etc. - Examples Lorel (Stanford) , UnQL (ATT), STRUQL
(ATT)
57Comparison of Query Systems
58Types of Query Languages
- First Generation
- Second generation
59First Generation Query Languages
- Combine the content-based queries of search
engines with structure-based queries - Combine conditions on text pattern in documents
with graph pattern describing link structures - Examples - W3QL (TECHNION, Israel)
- WebSQL (Toronto), WebLOG (Concordia)
60Second generation languages
- Called web data manipulation languages
- Web pages as atomic objects with properties that
they contain or do not contain certain text
patterns and they point to other objects - Useful for data wrapping, transformation, and
restructuring - Useful for web site transformation and
restructuring - Access to internal structure of web pages, it
helps in extracting a set of tuples from the web
pages of a movie database which requires parsing
and selectively access certain subtrees in the
parse tree
61How they Differ?
- Provide access to the structure of web objects
they manipulate - return structure - Model internal structures of web documents as
well as the external links that connect them - Support references to model hyperlinks and some
support to ordered collections of records for
more natural data representation - Ability to create new complex structures as a
result of a query
62Examples
63Information Integration
- To answer queries that may require extracting and
combining data from multiple web sources - Example - Movie database data about movies,
their start casts, directors, schedule etc. - Give me a movie playing time and a review of
movies starring Frank Sinatra, playing tonight in
Paris
64Approaches
- Web warehouse Data from multiple web sources is
loaded into a warehouse, all queries are applied
to warehouse data - Advantage - Warehouse needs to be updated when
data sources change - Disadvantage - Performance Improvement
- Virtual warehouse Data remain in the web
sources, queries are decomposed at run time into
queries to sources - Data is not replicated and is fresh
- Due to autonomy of web sources query optimization
and execution methodology mat differ and
performance may be affected - Good when the number of sources are large, data
changes frequently, little control over web
sources
65Virtual approach verses DBMS
- In virtual approach, data is not communicated
directly with storage manager, instead it
communicates to wrappers - Second, user does not pose queries directly in
the schema in which data is stored, user is free
from knowing the structure - User pose the queries to mediated schema, virtual
relations (not stored anywhere) designed for
particular application
66Steps in data integration
- Specification of mediated schema and
reformulation Mediated schema is the set of
collection and attribute names needed to
formulate queries - Data integration system translates the query on
the mediated schema into a query to data source - Completeness of data in web sources
- Differing query processing capabilities
- Query Optimization selecting a set of minimal
sources and minimal queries - Wrapper construction
- Matching objects across sources