Title: Web Warehousing : Design and Issues
1Web Warehousing Design and Issues
- Sanjay Kumar Madria
- Department of Computer Science
- Purdue University
- West Lafayette, IN 47907
- skm_at_cs.purdue.edu
2www.is.a.mess
3World Wide Web
- Web is fast growing
- More business organizations putting information
in the Web - Business on the highway
- Myriad of raw data to be processed for information
4WWW
- collection of multimedia documents in the form of
web pages connected via hyperlinks.
5As WWW grows, more chaotic it becomes
- Web is fast growing, distributed,
non-administered global information resource - WWW allows access to text, image, video, sound
and graphic data - more business organizations creating web servers
- more chaotic environment to locate information of
interest - lost in hyperspace syndrome
6Characteristics of WWW
- WWW is a set of directed graphs
- data in the WWW has a heterogeneous nature
- unstructured versus structured information
- no central authority to manage information
- Dynamic verses static information
- Web information discoveries - search engines
7Web is Growing!
- In 1994, WWW grew by 1758 !!
- June 1993 - 130
- June 1994 - 1265
- Dec. 1994 - 11,576
- April 1995 - 15,768
- July 1995 - 23,000
- 2000 - !!!!!
8COM domains are increasing!
- As of July 1995, 6.64 million host computers on
the Internet - 1.74 million are com domains
- 1.41 million are edu domains
- 0.30 million are net
- 0.27 million are gov
- 0.22 million are mil
- 0.20 million are org
9Top web countries
- 1. Canada (1) 80 9. New Zealand(7)101
- 2. US (4) 140 10. Sweden (9) 101
- 3. Ireland (3) 110 11. Israel (12) 112
- 4. Iceland (2) 68 12. Cyprus (8) 72
- 5. UK (14) 336 13. Hong Kong (15)148
- 6. Malta (5) 155 14. Norway (10) 64
- 7. Australia (6) 133 15. Switzerland (13) 75
- 8. Singapore (11) 207 16. Denmark (16) 105
10How users find web sites
- Indexes and search engines 75
- UseNet newsgroups 44
- Cool lists 27
- New lists 24
- Listservers 23
- Print ads 21
- Word-of-mouth and e-mail 17
- Linked web advertisement 4
11Limitations of Search Engines
- Do not exploit hyperlinks
- search is limited to string matching
- Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
- low accuracy
- replicated results
- no further manipulation possible
12Limitations of Search Engines
- ERROR 404!
- No efficient document management
- Query results cannot be further manipulated
- No efficient means for knowledge discovery
13Key Objectives
- Design a suitable data model to represent web
information - development of web algebra and query language
- Maintenance of Web data
- Development of knowledge discovery and web mining
tools - Web warehouse
14Current Research Projects
- Web Query System
- W3QS, WebSQL, AKIRA, NetQL, RAW,
- WebLog
- Semistructured Data
- LOREL, UnQL, WebOQL
- Website Management System
- STRUDEL
15Main Tasks
- Modeling and Querying the Web
- view web as directed graph
- content and link based queries
- example - find the page that contain the word
clinton which has a link from a page containing
word monica.
16 - Information Extraction and integration
- wrapper - program to extract a structured
representation of the data a set of tuples from
HTML pages. - Mediator - integration of data
- Web Site Construction and Restructuring
- creating sites
- modeling the structure of web sites
- restructuring data
17What to Model
- Structure of Web sites
- internal structure of web pages
- contents of web sites in finer granularties
18Data Representation of Web Data
- Graph Data Models
- Semistructured Data Models (also graph based)
19Graph Data Model
- labeled graph data model where node represents
web pages and arcs represent links between pages. - The labels on arcs can be viewed as attribute
names. - Regular path expression queries
20Semistructured Data Models
- Irregular data structure, no fixed schema known
and may be implicit in the data - schema may be large and may change frequently
- the schema is descriptive rather than
perspective describes the current state of data,
but violations of schema is still tolerated
21 - data is not strongly typed for different objects
the values of the same attributes may be of
differeing types. (heterogenious sources) - no restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes - ability to query the schemas acr variables which
get bound to labels on arcs, rather than nodes in
the graph
22Graph based Query Languages
- Use graph to model databases
- support regular path expressions and graph
construction in queries. - Examples
- Graph Log for hypertext queries
- graph query language for OO
23Query Languages for Semi-Structured data
- Use labeled graphs
- query the schema of data
- ability to accommodate irregularities in the
data, such as missing links etc. - Examples Lorel (Stanford) , UnQL (ATT), STRUQL
(ATT)
24Comparison of Query Systems
25WebSQL-University of Toronto
- Model web as relational database
- Use two relations Document and Anchor
- Document relation has one tuple for each document
in the web and the anchor relation has one tuple
for each anchor in each document
26WebSQL
- SQL-like query language for extracting
information from the web. - capable of systematic processing of either all
the links in a page, all the pages that can be
reached from a given URL through paths that match
a pattern, or a combination of both. - provides transparent access to index servers
27Web OQL (University of Toronto)
- provides a framework that supports a large class
of data - restructuring operations.
- Simple semistructured data model for documents
and record-based data - OQL-like syntax and regular expressions
- serves as a two-way bridge between databases and
the Web.
28WebDB
- View WWW as multimedia documents in the form of
web pages - WQL supports selection, aggregation, sorting,
summary, grouping - projection on title , URL, keywords, tables,
forms, images etc.
29Presentation Overview
- WHOWEDA - warehouse of web data
- Research objectives
- Current research
- Web Data Model (WICM)
- Web Algebra
- Future work
30If you build it, they will come
- More chaotic
- Increasingly difficult to locate information.
- Related data are scattered in a piecemeal
fashion - Data, data everywhere.but how to find it?
31How does it affect the corporate world?
- Lack of credibility of data
- Different sites with different data
- Same site different data
- Historical information is not available
- Previous versions of web data
- How does web data change with time
- Summarization over time
- Data to information
- Reduction in productivity
- Analysis is manual
32Web Warehouse Its Business Value
- Local data warehouse is inadequate
- Web is hot as a commercial medium
- Current size
- Future growth prospects
- Exceedingly attractive demographics
33WHOWEDA Research Objectives
- Build a web warehouse
- Web information access
- Web information manipulation
- Efficient visualization of web information
- Maintenance of web data
- Web data mining
- Overcome existing limitations
- Provide effective mechanisms to manipulate web
information
34WHOWEDA - What?
- WareHouse Of Web Data
- Subject - oriented
- Integrated
- Temporal
- Granularity - Lower, higher
- Some summary
- Not updatable
- Alternative information sources
35WHOWEDA! www.cais.ntu.edu.sg8000/whoweda
- A WareHouse Of WEb DAta
- Web Information Coupling Model (WICM)
- Web Objects
- Web Schema
- Web Information Coupling Algebra
- Web Information Maintenance
- Web Mining and Knowledge discovery
36User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Maintenance System
Web Information Mining System
Web Mart
Web Mart
Web Warehouse
Web Mart
Web Mart
37User
WWW
Web Query Display
Warehouse Concept Mart
Global Web Manipulation
Global Web Coupling
Pre processing
Global Ranking
Data Visualization
Schema Tightness
Web Warehouse
Data Visualization
Web Union
Web Select
Web Intersection
Web Project
Local Web Manipulation
Local Web Coupling
Schema Tightness
Local Ranking
Schema Search
Web Join
Schema Match
38WWW
Warehouse Concept Mart
Global Web Coupling
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
39Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
Lower-level Granularity
Web Information Manipulation Operators
Higher level Granularity
Summarized data
40Web Objects
- Node - url, title, format, size, date, text
- Link - source-url, target-url, label, link-type
- Web tuple
- Web table
- Web schema
- Web database
41Web Schema
- Metadata in the warehouse
- Structural summary of web table
- Coupling of related information begins with a
query graph - Query graph -gtWeb schema
42Web Information Coupling System
- A database system to couple related web
information - Web data model
- Web objects
- Web schema
- Web algebra
43Meta-data in WHOWEDA
- Web schema
- Schema -tree
- Information extracted from each web document or
node - URL, size, keywords, links, title, language,
multi-media details, version history
44Web Algebra
- Formal foundation of data representation and
manipulation in a web warehouse - Web operators
- Information access operator
- Information manipulation operators
- Web schema operators
- Data visualization operators
45 - Directly querying the WWW is an expensive and
repetitive affair - information are already materialized in different
web tables in the web database. - mean to gather these similar information by
additional manipulation of the materialized web
tables.
46Global Coupling - Information Access
- To integrate data from the Web
- To create historical data
- To couple related information from the WWW
satisfying a web schema - Operator to create web tables
- From web with no schema to web table with web
schema
47Global Coupling
- Match portions of the web that satisfy the web
schema - Input is a query graph
- Output is a web table
- Example
48Information Manipulation
- Used for analysis of web data in the warehouse
- Web select
- Web project
- Local web coupling
- Web join
- Web cartesian product
- Web union
- Web intersect
49Join Processing in Web Databases
50Web Join
- Concatenate tuples based on identical nodes or
documents - Input are two web tables and their schemas
- Output is a joined table
- Types
- Pi-web join, sigma-web join, outer joins, web
composition, semi web join
51Web Join
- Analyse web tables storing temporal data
- Used for combining related data from various web
tables - Mechanism to provide summarized information
- Mechanism to find alternative web document in
case of Document Not Found error - Example
52Example 1
- Produce a list of diseases with their symptoms,
evaluation procedures and treatment starting from
the web site at http//www.panacea.org/ - Web table Diseases
53Treatment list
q
Treatment
g
http//www.panacea.org/
Issues
Symptoms list
f
y
x
z
Symptoms
List of Diseases
e
Evaluation
Evaluation
w
p
54Treatment list
q1
g1
Treatment
http//www.panacea.org/
Issues
f1
Symptoms list
x0
z1
y1
Symptoms
AIDS
List of Diseases
e1
Evaluation
Evaluation
w1
p2
Elisa Test
55Example 2
- Produce a list of drugs, and their uses and side
effects starting from the web site at
http//www.panacea.org/ - Web table Drugs
56(No Transcript)
57Side effects of Indavir
Drug list
http//www.panacea.org/
Issues
r1
AIDS
a0
b1
c1
d1
Indavir
Side effects
List of Diseases
Use
s1
k1
Uses of Indavir
58Web Join Operator
- Information manipulation operator
- Manipulate information residing in a web database
to derive additional information - Harness useful, composite information from two
web tables - Capitalize on the reuse of retrieved data from
the WWW in order to reduce execution time of
queries
59Joinable Nodes
- Node variables participating in the web join
process - Expressed as a pair
- Each node in the pair should have identical URLs
60Web Join
- Combine two web tables by concatenating a web
tuple of one web table with a web tuple of other
web table whenever there exist joinable nodes - Joinable nodes are identified from the schemas of
the two web tables - URLs of the joinable nodes are identical
61Flavors of Web Join
- Natural Web Join
- Theta Web Join (web join followed by web select)
- Examples
- Further flavors
- Single-node join
- Multi-node join
62Treatment list
q
Treatment
g
http//www.panacea.org/
Symptoms list
Issues
List of Diseases
f
y
x
z
Symptoms
e
Evaluation
Evaluation
Drug list
w
p
Issues
r
Side effects
b
c
d
Side effects
Use
s
Uses
k
63AIDS treatment
q1
g1
Symptoms of AIDS
http//www.panacea.org/
f1
y1
x0
z1
AIDS
e1
AIDS
Evaluation
Elisa Test
w1
p2
r1
Side effects of Indavir
b1
c1
d1
Indavir
s1
Uses of Indavir
k1
64Join Existence
- Given two web tables, we determine if these two
web tables are joinable - Inspect the schemas of the web tables
- Satisfy joinability conditions based on
- node predicates
- link predicates
- node and link predicates
- locus of a node relative to a joinable node
65Join Construction
- To construct a joined schema, we construct
- node set
- link set
- connectivity set
- predicate set
- Construction of joined table
- Concatenating the web tuples of the two input
tables over the joinable nodes
66Web Select
- Extracts web tuples from web tables satisfying
certain conditions on node and link variables and
on connectivities - Input is select Schema
- Output is a web table satisfying the select schema
67Web Couple Coupling web information
68Why web coupling?
- Related information in the web is supplied by
different information provider. - Web documents containing similar information can
reside in different web tables in Web Database.
69Why web coupling?
- The web couple operator gives us the capability
to manipulate these web tables to harness useful
related information.
70Web Couple Operator
- Web couple operator is a composite operator.
- combination of Web Cartesian Product followed by
Web Select. - Web cartesian product followed by a web select is
a frequently used operation. - motivates us to create a separate composite
operator to handle this.
71 - Val(p) is the operand of the op(p).
72Definitions
- Coupling Nodes We define coupling nodes as node
variables participating in the web coupling. - We express the coupling nodes of two web schemas
as a pair i.e (c, z) since they cannot exist as
single node variable.
73Definitions
- One coupling node variable can be in more than
one pair. That is a set of pair of coupling nodes
are not disjoint. - The attribute of the coupling node as defined in
the predicate of the node is called coupling
attribute. - The predicate is called the coupling predicate.
74Web Coupling
75Types of web coupling
- Single node coupling Web coupling when only one
node variable in the each schema are involved. - Multinode coupling When more than one node
variables in each schemas participate in the web
coupling.
76Types of web coupling
- System driven web coupling In this case the
system to decide which are the node variables to
be coupled (coupling nodes). If atleast a pair of
coupling nodes cannot be identified then the web
tables cannot be coupled.
77Types of web coupling
- User driven web coupling In this case the user
decides which are the node variables to be
coupled (coupling nodes). - Coupling is performed only on those user
specified node variable(s).
78Types of web coupling
- Attribute driven web coupling In this case the
user specifies the coupling attributes. - Coupling is performed only on those user
specified coupling attribute(s).
79Types of web coupling
- Value driven web coupling In this case the user
specifies the values of the attributes of the
nodes on which coupling should be performed. - Coupling is performed only on those user
specified attribute values.
80Levels of web coupling
- Schema level web coupling.
- Tuple level web coupling.
81Schema level web coupling
- We inspect the schemas to decide whether the two
web tables can be coupled. - If coupling conditions cannot be identified then
the two web tables cannot be coupled. - We do not inspect the web tuples in the web table.
82Schema level web coupling
- Let n and m be the number of web tuples of the
two input web tables. Then the coupled web table
based on schema level web coupling will always
have nm web tuples.
83Tuple level web coupling
- We inspect the web tuples of the two input web
tables to identify nodes with similar
information. - The number of web tuples in the coupled web table
ltnm
84Why two levels?
- A schema does not capture all the information of
the web documents in a web table not always
possible to identify coupling condition by
inspecting the schemas. - possible to find existence of coupling nodes
which are not defined in the schemas.
85Why two levels?
- Tuple level coupling gives us a mean to correlate
web documents containing similar information from
the web tables (that cannot be identified from
their schemas) at the expense of additional
processing.
86Conditions for web coupling
- URLs with same directory name such as
/computer/ may contain similar information. - Paths with /cgi-bin/ are not considered.
- Include all conditions for web join.
87Construction of coupled schema (schema level)
- When atleast a pair of coupling nodes are
identical (same url). - When none of the pair are identical.
88Web Bags
- Existence of identical web tuples.
- Created due to web project operation.
- Structure based mining
- Used for discovering
- Visible nodes
- Luminous nodes
- Luminous paths
89Definitions
- Visibility of a web document or node D in a web
table W measures the number of different web
documents in W that have links to D - Luminosity - Reverse of visibility, the number of
other distinct documents that are linked from D - Luminous paths - a set of inter-linked nodes
which occurs number of times in a web table
90Web Schema
Cancer
http//www.panacea.org/
e
f
x
y
z
Cancer
Diseases
91Cancer
http//www.panacea.org/
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z1
x0
y0
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z2
x0
y0
Cancer
e0
Cancer
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z4
x0
y0
Cancer
e0
Web Table
92Projected schema
93Cancer
Web Table after eliminating x and y
94Projected schema
Cancer
http//www.panacea.org/
e
z
x
y
Diseases
95http//www.cancer.org/desc.html
http//www.cancer.org/desc.html
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
http//www.panacea.org/
Diseases
Cancer
x0
y0
z4
Web Bag
96After removal of identical tuples
http//www.cancer.org/desc.html
97Cancer
z1
http//www.cancer.org/desc.html
Cancer
http//www.cancer.org/desc.html
z1
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
98http//www.cancer.org/desc.html
99Visible Nodes
Cancer
http//www.cancer.org/desc.html
z1
Cancer
z2
http//www.disease.com/cancer/skin.htm
Cancer
z1
http//www.cancer.org/desc.html
Cancer
z4
http//www.jhu.edu/medical/research/cancer.htm
100Luminous Paths
101More Operators . . .
- Web schema operators
- Schema tightness operator, Schema match operator,
Schema search operator - Data visualization operators
- Ranking operators (Global Local), Web Nest, Web
Un-nest, Web Coalesce, Web Expand, Web Pack, Web
Unpack, Web Sort
102Partitioning of web tables
- Partitioning web tables
- restructured easily
- indexed easily
- monitored easily
- reorganized easily
- By
- time
- schema tree structure
- keywords
103Warehouse Concept Mart (WCMart)
- Subject oriented
- Concept generation.
- Manually -gt Autonomous.
- Used for
- Ranking tuples
- Global web coupling
- Content based mining
104Data Mining in Web Warehouse
- Scalability of data
- Text mining
- Mining information from multiple web tables
- Interactive web mining
- Discover rules
- Web Bag
- Warehouse Concept Mart
105Summary
- Current status
- Mechanism for accessing and manipulating web
information in WHOWEDA - Implementing various web operators
- Future research
- What types of information can be summarized?
- What types of knowledge can be mined?
- Refine web warehouse architecture
- www.cais.ntu.edu.sg8000/whoweda