Web Warehousing : Design and Issues - PowerPoint PPT Presentation

About This Presentation
Title:

Web Warehousing : Design and Issues

Description:

Structure of Web sites. internal structure of web pages ... Related data are scattered in a piecemeal fashion. Data, data everywhere....but how to find it? ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 106
Provided by: skm
Category:

less

Transcript and Presenter's Notes

Title: Web Warehousing : Design and Issues


1
Web Warehousing Design and Issues
  • Sanjay Kumar Madria
  • Department of Computer Science
  • Purdue University
  • West Lafayette, IN 47907
  • skm_at_cs.purdue.edu

2
www.is.a.mess
3
World Wide Web
  • Web is fast growing
  • More business organizations putting information
    in the Web
  • Business on the highway
  • Myriad of raw data to be processed for information

4
WWW
  • collection of multimedia documents in the form of
    web pages connected via hyperlinks.

5
As WWW grows, more chaotic it becomes
  • Web is fast growing, distributed,
    non-administered global information resource
  • WWW allows access to text, image, video, sound
    and graphic data
  • more business organizations creating web servers
  • more chaotic environment to locate information of
    interest
  • lost in hyperspace syndrome

6
Characteristics of WWW
  • WWW is a set of directed graphs
  • data in the WWW has a heterogeneous nature
  • unstructured versus structured information
  • no central authority to manage information
  • Dynamic verses static information
  • Web information discoveries - search engines

7
Web is Growing!
  • In 1994, WWW grew by 1758 !!
  • June 1993 - 130
  • June 1994 - 1265
  • Dec. 1994 - 11,576
  • April 1995 - 15,768
  • July 1995 - 23,000
  • 2000 - !!!!!

8
COM domains are increasing!
  • As of July 1995, 6.64 million host computers on
    the Internet
  • 1.74 million are com domains
  • 1.41 million are edu domains
  • 0.30 million are net
  • 0.27 million are gov
  • 0.22 million are mil
  • 0.20 million are org

9
Top web countries
  • 1. Canada (1) 80 9. New Zealand(7)101
  • 2. US (4) 140 10. Sweden (9) 101
  • 3. Ireland (3) 110 11. Israel (12) 112
  • 4. Iceland (2) 68 12. Cyprus (8) 72
  • 5. UK (14) 336 13. Hong Kong (15)148
  • 6. Malta (5) 155 14. Norway (10) 64
  • 7. Australia (6) 133 15. Switzerland (13) 75
  • 8. Singapore (11) 207 16. Denmark (16) 105

10
How users find web sites
  • Indexes and search engines 75
  • UseNet newsgroups 44
  • Cool lists 27
  • New lists 24
  • Listservers 23
  • Print ads 21
  • Word-of-mouth and e-mail 17
  • Linked web advertisement 4

11
Limitations of Search Engines
  • Do not exploit hyperlinks
  • search is limited to string matching
  • Queries are evaluated on archived data rather
    than up-to-date data no indexing on current data
  • low accuracy
  • replicated results
  • no further manipulation possible

12
Limitations of Search Engines
  • ERROR 404!
  • No efficient document management
  • Query results cannot be further manipulated
  • No efficient means for knowledge discovery

13
Key Objectives
  • Design a suitable data model to represent web
    information
  • development of web algebra and query language
  • Maintenance of Web data
  • Development of knowledge discovery and web mining
    tools
  • Web warehouse

14
Current Research Projects
  • Web Query System
  • W3QS, WebSQL, AKIRA, NetQL, RAW,
  • WebLog
  • Semistructured Data
  • LOREL, UnQL, WebOQL
  • Website Management System
  • STRUDEL

15
Main Tasks
  • Modeling and Querying the Web
  • view web as directed graph
  • content and link based queries
  • example - find the page that contain the word
    clinton which has a link from a page containing
    word monica.

16
  • Information Extraction and integration
  • wrapper - program to extract a structured
    representation of the data a set of tuples from
    HTML pages.
  • Mediator - integration of data
  • Web Site Construction and Restructuring
  • creating sites
  • modeling the structure of web sites
  • restructuring data

17
What to Model
  • Structure of Web sites
  • internal structure of web pages
  • contents of web sites in finer granularties

18
Data Representation of Web Data
  • Graph Data Models
  • Semistructured Data Models (also graph based)

19
Graph Data Model
  • labeled graph data model where node represents
    web pages and arcs represent links between pages.
  • The labels on arcs can be viewed as attribute
    names.
  • Regular path expression queries

20
Semistructured Data Models
  • Irregular data structure, no fixed schema known
    and may be implicit in the data
  • schema may be large and may change frequently
  • the schema is descriptive rather than
    perspective describes the current state of data,
    but violations of schema is still tolerated

21
  • data is not strongly typed for different objects
    the values of the same attributes may be of
    differeing types. (heterogenious sources)
  • no restriction on the set of arcs that emanate
    from a given node in a graph or on the types of
    the values of attributes
  • ability to query the schemas acr variables which
    get bound to labels on arcs, rather than nodes in
    the graph

22
Graph based Query Languages
  • Use graph to model databases
  • support regular path expressions and graph
    construction in queries.
  • Examples
  • Graph Log for hypertext queries
  • graph query language for OO

23
Query Languages for Semi-Structured data
  • Use labeled graphs
  • query the schema of data
  • ability to accommodate irregularities in the
    data, such as missing links etc.
  • Examples Lorel (Stanford) , UnQL (ATT), STRUQL
    (ATT)

24
Comparison of Query Systems
25
WebSQL-University of Toronto
  • Model web as relational database
  • Use two relations Document and Anchor
  • Document relation has one tuple for each document
    in the web and the anchor relation has one tuple
    for each anchor in each document

26
WebSQL
  • SQL-like query language for extracting
    information from the web.
  • capable of systematic processing of either all
    the links in a page, all the pages that can be
    reached from a given URL through paths that match
    a pattern, or a combination of both.
  • provides transparent access to index servers

27
Web OQL (University of Toronto)
  • provides a framework that supports a large class
    of data
  • restructuring operations.
  • Simple semistructured data model for documents
    and record-based data
  • OQL-like syntax and regular expressions
  • serves as a two-way bridge between databases and
    the Web.

28
WebDB
  • View WWW as multimedia documents in the form of
    web pages
  • WQL supports selection, aggregation, sorting,
    summary, grouping
  • projection on title , URL, keywords, tables,
    forms, images etc.

29
Presentation Overview
  • WHOWEDA - warehouse of web data
  • Research objectives
  • Current research
  • Web Data Model (WICM)
  • Web Algebra
  • Future work

30
If you build it, they will come
  • More chaotic
  • Increasingly difficult to locate information.
  • Related data are scattered in a piecemeal
    fashion
  • Data, data everywhere.but how to find it?

31
How does it affect the corporate world?
  • Lack of credibility of data
  • Different sites with different data
  • Same site different data
  • Historical information is not available
  • Previous versions of web data
  • How does web data change with time
  • Summarization over time
  • Data to information
  • Reduction in productivity
  • Analysis is manual

32
Web Warehouse Its Business Value
  • Local data warehouse is inadequate
  • Web is hot as a commercial medium
  • Current size
  • Future growth prospects
  • Exceedingly attractive demographics

33
WHOWEDA Research Objectives
  • Build a web warehouse
  • Web information access
  • Web information manipulation
  • Efficient visualization of web information
  • Maintenance of web data
  • Web data mining
  • Overcome existing limitations
  • Provide effective mechanisms to manipulate web
    information

34
WHOWEDA - What?
  • WareHouse Of Web Data
  • Subject - oriented
  • Integrated
  • Temporal
  • Granularity - Lower, higher
  • Some summary
  • Not updatable
  • Alternative information sources

35
WHOWEDA! www.cais.ntu.edu.sg8000/whoweda
  • A WareHouse Of WEb DAta
  • Web Information Coupling Model (WICM)
  • Web Objects
  • Web Schema
  • Web Information Coupling Algebra
  • Web Information Maintenance
  • Web Mining and Knowledge discovery

36
User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Maintenance System
Web Information Mining System
Web Mart
Web Mart
Web Warehouse
Web Mart
Web Mart
37
User
WWW
Web Query Display
Warehouse Concept Mart
Global Web Manipulation
Global Web Coupling
Pre processing
Global Ranking
Data Visualization
Schema Tightness
Web Warehouse
Data Visualization
Web Union
Web Select
Web Intersection
Web Project
Local Web Manipulation
Local Web Coupling
Schema Tightness
Local Ranking
Schema Search
Web Join
Schema Match
38
WWW
Warehouse Concept Mart
Global Web Coupling
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
39
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
Lower-level Granularity
Web Information Manipulation Operators
Higher level Granularity
Summarized data
40
Web Objects
  • Node - url, title, format, size, date, text
  • Link - source-url, target-url, label, link-type
  • Web tuple
  • Web table
  • Web schema
  • Web database

41
Web Schema
  • Metadata in the warehouse
  • Structural summary of web table
  • Coupling of related information begins with a
    query graph
  • Query graph -gtWeb schema

42
Web Information Coupling System
  • A database system to couple related web
    information
  • Web data model
  • Web objects
  • Web schema
  • Web algebra

43
Meta-data in WHOWEDA
  • Web schema
  • Schema -tree
  • Information extracted from each web document or
    node
  • URL, size, keywords, links, title, language,
    multi-media details, version history

44
Web Algebra
  • Formal foundation of data representation and
    manipulation in a web warehouse
  • Web operators
  • Information access operator
  • Information manipulation operators
  • Web schema operators
  • Data visualization operators

45
  • Directly querying the WWW is an expensive and
    repetitive affair
  • information are already materialized in different
    web tables in the web database.
  • mean to gather these similar information by
    additional manipulation of the materialized web
    tables.

46
Global Coupling - Information Access
  • To integrate data from the Web
  • To create historical data
  • To couple related information from the WWW
    satisfying a web schema
  • Operator to create web tables
  • From web with no schema to web table with web
    schema

47
Global Coupling
  • Match portions of the web that satisfy the web
    schema
  • Input is a query graph
  • Output is a web table
  • Example

48
Information Manipulation
  • Used for analysis of web data in the warehouse
  • Web select
  • Web project
  • Local web coupling
  • Web join
  • Web cartesian product
  • Web union
  • Web intersect

49
Join Processing in Web Databases

50
Web Join
  • Concatenate tuples based on identical nodes or
    documents
  • Input are two web tables and their schemas
  • Output is a joined table
  • Types
  • Pi-web join, sigma-web join, outer joins, web
    composition, semi web join

51
Web Join
  • Analyse web tables storing temporal data
  • Used for combining related data from various web
    tables
  • Mechanism to provide summarized information
  • Mechanism to find alternative web document in
    case of Document Not Found error
  • Example

52
Example 1
  • Produce a list of diseases with their symptoms,
    evaluation procedures and treatment starting from
    the web site at http//www.panacea.org/
  • Web table Diseases

53
Treatment list
q
Treatment
g
http//www.panacea.org/
Issues
Symptoms list
f
y
x
z
Symptoms
List of Diseases
e
Evaluation
Evaluation
w
p
54
Treatment list
q1
g1
Treatment
http//www.panacea.org/
Issues
f1
Symptoms list
x0
z1
y1
Symptoms
AIDS
List of Diseases
e1
Evaluation
Evaluation
w1
p2
Elisa Test
55
Example 2
  • Produce a list of drugs, and their uses and side
    effects starting from the web site at
    http//www.panacea.org/
  • Web table Drugs

56
(No Transcript)
57
Side effects of Indavir
Drug list
http//www.panacea.org/
Issues
r1
AIDS
a0
b1
c1
d1
Indavir
Side effects
List of Diseases
Use
s1
k1
Uses of Indavir
58
Web Join Operator
  • Information manipulation operator
  • Manipulate information residing in a web database
    to derive additional information
  • Harness useful, composite information from two
    web tables
  • Capitalize on the reuse of retrieved data from
    the WWW in order to reduce execution time of
    queries

59
Joinable Nodes
  • Node variables participating in the web join
    process
  • Expressed as a pair
  • Each node in the pair should have identical URLs

60
Web Join
  • Combine two web tables by concatenating a web
    tuple of one web table with a web tuple of other
    web table whenever there exist joinable nodes
  • Joinable nodes are identified from the schemas of
    the two web tables
  • URLs of the joinable nodes are identical

61
Flavors of Web Join
  • Natural Web Join
  • Theta Web Join (web join followed by web select)
  • Examples
  • Further flavors
  • Single-node join
  • Multi-node join

62
Treatment list
q
Treatment
g
http//www.panacea.org/
Symptoms list
Issues
List of Diseases
f
y
x
z
Symptoms
e
Evaluation
Evaluation
Drug list
w
p
Issues
r
Side effects
b
c
d
Side effects
Use
s
Uses
k
63
AIDS treatment
q1
g1
Symptoms of AIDS
http//www.panacea.org/
f1
y1
x0
z1
AIDS
e1
AIDS
Evaluation
Elisa Test
w1
p2
r1
Side effects of Indavir
b1
c1
d1
Indavir
s1
Uses of Indavir
k1
64
Join Existence
  • Given two web tables, we determine if these two
    web tables are joinable
  • Inspect the schemas of the web tables
  • Satisfy joinability conditions based on
  • node predicates
  • link predicates
  • node and link predicates
  • locus of a node relative to a joinable node

65
Join Construction
  • To construct a joined schema, we construct
  • node set
  • link set
  • connectivity set
  • predicate set
  • Construction of joined table
  • Concatenating the web tuples of the two input
    tables over the joinable nodes

66
Web Select
  • Extracts web tuples from web tables satisfying
    certain conditions on node and link variables and
    on connectivities
  • Input is select Schema
  • Output is a web table satisfying the select schema

67
Web Couple Coupling web information

68
Why web coupling?
  • Related information in the web is supplied by
    different information provider.
  • Web documents containing similar information can
    reside in different web tables in Web Database.

69
Why web coupling?
  • The web couple operator gives us the capability
    to manipulate these web tables to harness useful
    related information.

70
Web Couple Operator
  • Web couple operator is a composite operator.
  • combination of Web Cartesian Product followed by
    Web Select.
  • Web cartesian product followed by a web select is
    a frequently used operation.
  • motivates us to create a separate composite
    operator to handle this.

71
  • Val(p) is the operand of the op(p).

72
Definitions
  • Coupling Nodes We define coupling nodes as node
    variables participating in the web coupling.
  • We express the coupling nodes of two web schemas
    as a pair i.e (c, z) since they cannot exist as
    single node variable.

73
Definitions
  • One coupling node variable can be in more than
    one pair. That is a set of pair of coupling nodes
    are not disjoint.
  • The attribute of the coupling node as defined in
    the predicate of the node is called coupling
    attribute.
  • The predicate is called the coupling predicate.

74
Web Coupling
75
Types of web coupling
  • Single node coupling Web coupling when only one
    node variable in the each schema are involved.
  • Multinode coupling When more than one node
    variables in each schemas participate in the web
    coupling.

76
Types of web coupling
  • System driven web coupling In this case the
    system to decide which are the node variables to
    be coupled (coupling nodes). If atleast a pair of
    coupling nodes cannot be identified then the web
    tables cannot be coupled.

77
Types of web coupling
  • User driven web coupling In this case the user
    decides which are the node variables to be
    coupled (coupling nodes).
  • Coupling is performed only on those user
    specified node variable(s).

78
Types of web coupling
  • Attribute driven web coupling In this case the
    user specifies the coupling attributes.
  • Coupling is performed only on those user
    specified coupling attribute(s).

79
Types of web coupling
  • Value driven web coupling In this case the user
    specifies the values of the attributes of the
    nodes on which coupling should be performed.
  • Coupling is performed only on those user
    specified attribute values.

80
Levels of web coupling
  • Schema level web coupling.
  • Tuple level web coupling.

81
Schema level web coupling
  • We inspect the schemas to decide whether the two
    web tables can be coupled.
  • If coupling conditions cannot be identified then
    the two web tables cannot be coupled.
  • We do not inspect the web tuples in the web table.

82
Schema level web coupling
  • Let n and m be the number of web tuples of the
    two input web tables. Then the coupled web table
    based on schema level web coupling will always
    have nm web tuples.

83
Tuple level web coupling
  • We inspect the web tuples of the two input web
    tables to identify nodes with similar
    information.
  • The number of web tuples in the coupled web table
    ltnm

84
Why two levels?
  • A schema does not capture all the information of
    the web documents in a web table not always
    possible to identify coupling condition by
    inspecting the schemas.
  • possible to find existence of coupling nodes
    which are not defined in the schemas.

85
Why two levels?
  • Tuple level coupling gives us a mean to correlate
    web documents containing similar information from
    the web tables (that cannot be identified from
    their schemas) at the expense of additional
    processing.

86
Conditions for web coupling
  • URLs with same directory name such as
    /computer/ may contain similar information.
  • Paths with /cgi-bin/ are not considered.
  • Include all conditions for web join.

87
Construction of coupled schema (schema level)
  • When atleast a pair of coupling nodes are
    identical (same url).
  • When none of the pair are identical.

88
Web Bags
  • Existence of identical web tuples.
  • Created due to web project operation.
  • Structure based mining
  • Used for discovering
  • Visible nodes
  • Luminous nodes
  • Luminous paths

89
Definitions
  • Visibility of a web document or node D in a web
    table W measures the number of different web
    documents in W that have links to D
  • Luminosity - Reverse of visibility, the number of
    other distinct documents that are linked from D
  • Luminous paths - a set of inter-linked nodes
    which occurs number of times in a web table

90
Web Schema
Cancer
http//www.panacea.org/
e
f
x
y
z
Cancer
Diseases
91
Cancer
http//www.panacea.org/
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z1
x0
y0
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z2
x0
y0
Cancer
e0
Cancer
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z4
x0
y0
Cancer
e0
Web Table
92
Projected schema
93
Cancer
Web Table after eliminating x and y
94
Projected schema
Cancer
http//www.panacea.org/
e
z
x
y
Diseases
95
http//www.cancer.org/desc.html
http//www.cancer.org/desc.html
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
http//www.panacea.org/
Diseases
Cancer
x0
y0
z4
Web Bag
96
After removal of identical tuples
http//www.cancer.org/desc.html
97
Cancer
z1
http//www.cancer.org/desc.html
Cancer
http//www.cancer.org/desc.html
z1
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
98
http//www.cancer.org/desc.html
99
Visible Nodes
Cancer
http//www.cancer.org/desc.html
z1
Cancer
z2
http//www.disease.com/cancer/skin.htm
Cancer
z1
http//www.cancer.org/desc.html
Cancer
z4
http//www.jhu.edu/medical/research/cancer.htm
100
Luminous Paths
101
More Operators . . .
  • Web schema operators
  • Schema tightness operator, Schema match operator,
    Schema search operator
  • Data visualization operators
  • Ranking operators (Global Local), Web Nest, Web
    Un-nest, Web Coalesce, Web Expand, Web Pack, Web
    Unpack, Web Sort

102
Partitioning of web tables
  • Partitioning web tables
  • restructured easily
  • indexed easily
  • monitored easily
  • reorganized easily
  • By
  • time
  • schema tree structure
  • keywords

103
Warehouse Concept Mart (WCMart)
  • Subject oriented
  • Concept generation.
  • Manually -gt Autonomous.
  • Used for
  • Ranking tuples
  • Global web coupling
  • Content based mining

104
Data Mining in Web Warehouse
  • Scalability of data
  • Text mining
  • Mining information from multiple web tables
  • Interactive web mining
  • Discover rules
  • Web Bag
  • Warehouse Concept Mart

105
Summary
  • Current status
  • Mechanism for accessing and manipulating web
    information in WHOWEDA
  • Implementing various web operators
  • Future research
  • What types of information can be summarized?
  • What types of knowledge can be mined?
  • Refine web warehouse architecture
  • www.cais.ntu.edu.sg8000/whoweda
Write a Comment
User Comments (0)
About PowerShow.com