Web Data Management - PowerPoint PPT Presentation

About This Presentation
Title:

Web Data Management

Description:

No efficient document management. Query results cannot be further manipulated ... Language - good for publishing deeply structured document ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 67
Provided by: sanjay70
Learn more at: https://web.mst.edu
Category:
Tags: data | management | web

less

Transcript and Presenter's Notes

Title: Web Data Management


1
Class Number CS 412
Web Data MGMT and XML
Instructor Sanjay Madria
Lesson Title - Introduction
2
  • The link for the Real Player live stream for the
    is
  • http//movie.umr.edu/ramgen/encoder/liveCS412F03.r
    m
  • The link to view the archived Real Player lecture
    at 28 and 56 kbs is
  • http//movie.umr.edu/ramgen/CoursesF02/CS412F03/CS
    412Lec082803kbs2856.rm
  • (The lecture date section 082803 will change for
    each produced class)
  • The link to view the Real Player archived lecture
    at 200 kbs is http//movie.umr.edu/ramgen/Courses
    F03/CS412F03/CS412Lec082803kbs200.rm
  •    For example, to watch the lecture using real
    player for say 15th Sept, you modify the date as
    CS412Lec091503kbs200.rm

3
Web Data Management and XML
  • Sanjay Kumar Madria
  • Department of Computer Science
  • University of Missouri-Rolla
  • madrias_at_umr.edu

4
WWW
  • Huge, widely distributed, heterogeneous
    collection of semi-structured multimedia
    documents in the form of web pages connected via
    hyperlinks.

5
World Wide Web
  • Web is fast growing
  • More business organizations putting information
    in the Web
  • Business on the highway
  • Myriad of raw data to be processed for information

6
As WWW grows, more chaotic it becomes
  • Web is fast growing, distributed,
    non-administered global information resource
  • WWW allows access to text, image, video, sound
    and graphic data
  • More business organizations creating web servers
  • More chaotic environment to locate information of
    interest
  • Lost in hyperspace syndrome

7
Characteristics of WWW
  • WWW is a set of directed graphs
  • Data in the WWW has a heterogeneous nature,
    self-describing and schema less
  • Unstructured information , deeply nested
  • No central authority to manage information
  • Dynamic verses static information
  • Web information discoveries - search engines

8
Web is Growing!
  • In 1994, WWW grew by 1758 !!
  • June 1993 - 130
  • June 1994 - 1265
  • Dec. 1994 - 11,576
  • April 1995 - 15,768
  • July 1995 - 23,000
  • 2000 - !!!!!

9
COM domains are increasing!
  • As of July 1995, 6.64 million host computers on
    the Internet
  • 1.74 million are com domains
  • 1.41 million are edu domains
  • 0.30 million are net
  • 0.27 million are gov
  • 0.22 million are mil
  • 0.20 million are org

10
The number of Internet hosts exceeded...
  • 1000 in 1984
  • 10000 in 1987
  • 100000 in 1989
  • 1.000.000 in 1992
  • 10.000.000 in 1996
  • 100.000.000 in 2000

11
Top web countries
  • 1. Canada (1) 80 9. New Zealand(7)101
  • 2. US (4) 140 10. Sweden (9) 101
  • 3. Ireland (3) 110 11. Israel (12) 112
  • 4. Iceland (2) 68 12. Cyprus (8) 72
  • 5. UK (14) 336 13. Hong Kong (15)148
  • 6. Malta (5) 155 14. Norway (10) 64
  • 7. Australia (6) 133 15. Switzerland (13) 75
  • 8. Singapore (11) 207 16. Denmark (16) 105

12
How users find web sites
  • Indexes and search engines 75
  • UseNet newsgroups 44
  • Cool lists 27
  • New lists 24
  • Listservers 23
  • Print ads 21
  • Word-of-mouth and e-mail 17
  • Linked web advertisement 4

13
Limitations of Search Engines
  • Do not exploit hyperlinks
  • Search is limited to string matching
  • Queries are evaluated on archived data rather
    than up-to-date data no indexing on current data
  • Low accuracy
  • Replicated results
  • No further manipulation possible

14
Limitations of Search Engines
  • ERROR 404!
  • No efficient document management
  • Query results cannot be further manipulated
  • No efficient means for knowledge discovery

15
More PROBLEMS
  • Specifying/understanding what information is
    wanted
  • High degree of variability of accessible
    information
  • Variability in conceptual vocabulary or
    ontology used to describe information
  • Complexity of querying unstructured data

16
  • Complexity of querying structured data
  • Uncontrolled nature of web-based information
    content
  • Determining which information sources to
    search/query

17
  • Search Engine Capabilities
  • Selection of language
  • Keywords with disjunction, adjacency, presence,
    absence, ...
  • Word stemming (Hotbot)
  • Similarity search (Excite)
  • Natural language (LycosPro)
  • Restrict by modification date (Hotbot) or range
    of dates (Alta Vista)
  • Restrict result types (e.g., must include images)
    (Hotbot)
  • Restrict by geographical source (content or
    domain) (Hotbot)
  • Restrict within various structured regions of a
    document (titles or URLs) (Lycos Pro) (summary,
    first heading, title, URL) (Opentext)

18
SEARCH RETRIEVAL
  • Search Engines

Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3
  • using several search engines is better than
    using only one
  • Source Lawrence, S., and Giles, C.L., Searching
    the World Wide Web, Science 280, pp. 98-100,
    1998.

19
Schemes to locate information
  • Supervised links between sites
  • ask at the reference desk
  • Classification of documents
  • search in the catalog
  • Automated searching
  • wander around the library

20
The most popular search engines
  • Year 2000
  • AltaVista
  • Yahoo
  • HotBot
  • Year 2001
  • Google
  • NorthernLight
  • AltaVista

21
Boolean search in AltaVista
22
Specifying field content in HotBot
23
Natural language interface in AskJeeves
24
Three examples of search strategies
  • Rank web pages based on popularity
  • Rank web pages based on word frequency
  • Match query to an expert database
  • All the major search engines use a mixed
    strategy in ranking web pages and responding to
    queries

25
Rank based on word frequency
  • Library analogue Keyword search
  • Basic factors in HotBot ranking of pages
  • words in the title
  • keyword meta tags
  • word frequency in the document
  • document length

26
Alternative word frequency measures
  • Excite uses a thesaurus to search for what you
    want, rather than what you ask for
  • AltaVista allows you to look for words that occur
    within a set distance of each other
  • NorthernLight weighs results by search term
    sequence, from left to right

27
Rank based on popularity
  • Library analogue citation index
  • The Google strategy for ranking pages
  • Rank is based on the number of links to a page
  • Pages with a high rank have a lot of other web
    pages that link to it
  • The formula is on the Google help page ?

28
More on popularity ranking
  • The Google philosophy is also applied by others,
    such as NorthernLight
  • HotBot measures the popularity of a page by how
    frequently users have clicked on it in past
    search results

29
Expert databases Yahoo!
  • An expert database contains predefined responses
    to common queries
  • A simple approach is subject directory, e.g. in
    Yahoo!, which contains a selection of links for
    each topic
  • The selection is small, but can be useful
  • Library analogue Trustworthy references

30
Expert databases AskJeeves
  • AskJeeves has predefined responses to various
    types of common queries
  • These prepared answers are augmented by a
    meta-search, which searches other SEs
  • Library analogue Reference desk

31
Best wines in France AskJeeves
32
Best wines in France HotBot
33
Best wines in France Google
34
Linux in Iceland Google
35
Linux in Iceland HotBot
36
Linux in Iceland AskJeeves
37
Web Data Management is the Key
38
Key Objectives
  • Design a suitable data model to represent web
    information
  • Development of web algebra and query language,
    query optimization
  • Maintenance of Web data - View Maintenance
  • Development of knowledge discovery and web mining
    tools
  • Web warehouse
  • Web data integration , secondary storages,
    indexes

39
Limitations of the Web Today
  • Applications can not consume HTML
  • HTML wrapper technology is brittle
  • Companies merge , need interoperability fast

40
Paradigm Shift
  • New Web standards XML
  • XML generated by applications and consumed by
    applications
  • Data exchange
  • Across platforms enterprise interoperability
  • Across enterprises
  • Web from documents to data

41
Database challenges
  • Query optimization and processing
  • Views and transformations
  • Data warehousing and data integration
  • Mediators and query rewriting
  • Secondary storages
  • indexes

42
DBMS needs paradigm shift to
  • Web data differs from database data
  • self describing, schema less
  • structure changes without notice
  • heterogeneous, deeply nested, irregular
  • documents and data mixed
  • Designed by document, but not db expert
  • Need web data mgmt

43
Web Data Representation
  • HTML - Hypertext Markup Language
  • fixed grammar, no regular expressions
  • Simple representation of data
  • good for simple data and intended for human
    consumption
  • difficult to extract information
  • SGML - Standard Generalized Markup
  • Language - good for publishing deeply structured
    document
  • XML - Extended Markup Language -a subset of SGML

44
Terminology
  • HTML - Hypertext Mark-up Language
  • HTTP - Hypertext Transmission Protocol
  • URL - Uniform Resource Locator
  • example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
    megtltlocationgt where
  • ltprotocolgt is http, ftp, gopher
  • host is internet address
  • location is a textual label in the file.

45
  • Links are specified as
  • ltA HREFDestination URLgtAnhor Textlt/Agt
  • destination URL is the URL of the destination
    document and Anchor Text is the text that appears
    as an anchor when displayed.
  • Example
  • ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
    Technological Universitylt/Agt
  • Absolute and relative
  • URL ltA HREF"AtlanticStates/NYStats.html"gtNew
    Yorklt/Agt is relative
  • ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
    / WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
    to HTMLlt/Agt absolute address

46
World Wide Web
  • Prevalent, persistent and informative
  • HTML documents (soon, XML) created by humans or
    applications.
  • Accessed day in and day out by humans and
    applications.
  • Persistent HTML documents!!!

Can database technology help?
47
Current Research Projects
  • Web Query System
  • W3QS, WebSQL, AKIRA, NetQL, RAW,
  • WebLog, Araneus
  • Semistructured Data Management
  • LOREL, UnQL, WebOQL, Florid
  • Website Management System
  • STRUDEL, Araneus
  • Web Warehouse
  • WHOWEDA, Xylem.com

48
Main Tasks
  • Modeling and Querying the Web
  • view web as directed graph
  • content and link based queries
  • example - find the page that contain the word
    clinton which has a link from a page containing
    word monica.

49
  • Information Extraction and integration
  • wrapper - program to extract a structured
    representation of the data a set of tuples from
    HTML pages.
  • Mediator - integration of data-softwares that
    access multiple source from a uniform interface
  • Web Site Construction and Restructuring
  • creating sites
  • modeling the structure of web sites
  • restructuring data

50
What to Model
  • Structure of Web sites
  • Internal structure of web pages
  • Contents of web sites in finer granularities

51
Data Representation of Web Data
  • Graph Data Models
  • Semistructured Data Models (also graph based)

52
Graph Data Model
  • Labeled graph data model where node represents
    web pages and arcs represent links between pages.
  • Labels on arcs can be viewed as attribute names.
  • Regular path expression queries

53
Semistructured Data Models
  • Irregular data structure, no fixed schema known
    and may be implicit in the data
  • Schema may be large and may change frequently
  • Schema is descriptive rather than perspective
    describes the current state of data, but
    violations of schema is still tolerated

54
  • Data is not strongly typed for different objects
    the values of the same attributes may be of
    differing types. (heterogenious sources)
  • No restriction on the set of arcs that emanate
    from a given node in a graph or on the types of
    the values of attributes
  • Ability to query the schemas acr variables which
    get bound to labels on arcs, rather than nodes in
    the graph

55
Graph based Query Languages
  • Use graph to model databases
  • Support regular path expressions and graph
    construction in queries.
  • Examples
  • Graph Log for hypertext queries
  • graph query language for OO

56
Query Languages for Semi-Structured data
  • Use labeled graphs
  • Query the schema of data
  • Ability to accommodate irregularities in the
    data, such as missing links etc.
  • Examples Lorel (Stanford) , UnQL (ATT), STRUQL
    (ATT)

57
Comparison of Query Systems
58
Types of Query Languages
  • First Generation
  • Second generation

59
First Generation Query Languages
  • Combine the content-based queries of search
    engines with structure-based queries
  • Combine conditions on text pattern in documents
    with graph pattern describing link structures
  • Examples - W3QL (TECHNION, Israel)
  • WebSQL (Toronto), WebLOG (Concordia)

60
Second generation languages
  • Called web data manipulation languages
  • Web pages as atomic objects with properties that
    they contain or do not contain certain text
    patterns and they point to other objects
  • Useful for data wrapping, transformation, and
    restructuring
  • Useful for web site transformation and
    restructuring
  • Access to internal structure of web pages, it
    helps in extracting a set of tuples from the web
    pages of a movie database which requires parsing
    and selectively access certain subtrees in the
    parse tree

61
How they Differ?
  • Provide access to the structure of web objects
    they manipulate - return structure
  • Model internal structures of web documents as
    well as the external links that connect them
  • Support references to model hyperlinks and some
    support to ordered collections of records for
    more natural data representation
  • Ability to create new complex structures as a
    result of a query

62
Examples
  • Web OQL
  • STRUQL
  • Florid

63
Information Integration
  • To answer queries that may require extracting and
    combining data from multiple web sources
  • Example - Movie database data about movies,
    their start casts, directors, schedule etc.
  • Give me a movie playing time and a review of
    movies starring Frank Sinatra, playing tonight in
    Paris

64
Approaches
  • Web warehouse Data from multiple web sources is
    loaded into a warehouse, all queries are applied
    to warehouse data
  • Advantage - Warehouse needs to be updated when
    data sources change
  • Disadvantage - Performance Improvement
  • Virtual warehouse Data remain in the web
    sources, queries are decomposed at run time into
    queries to sources
  • Data is not replicated and is fresh
  • Due to autonomy of web sources query optimization
    and execution methodology mat differ and
    performance may be affected
  • Good when the number of sources are large, data
    changes frequently, little control over web
    sources

65
Virtual approach verses DBMS
  • In virtual approach, data is not communicated
    directly with storage manager, instead it
    communicates to wrappers
  • Second, user does not pose queries directly in
    the schema in which data is stored, user is free
    from knowing the structure
  • User pose the queries to mediated schema, virtual
    relations (not stored anywhere) designed for
    particular application

66
Steps in data integration
  • Specification of mediated schema and
    reformulation Mediated schema is the set of
    collection and attribute names needed to
    formulate queries
  • Data integration system translates the query on
    the mediated schema into a query to data source
  • Completeness of data in web sources
  • Differing query processing capabilities
  • Query Optimization selecting a set of minimal
    sources and minimal queries
  • Wrapper construction
  • Matching objects across sources
Write a Comment
User Comments (0)
About PowerShow.com