Crawling the Hidden Web - PowerPoint PPT Presentation

About This Presentation
Title:

Crawling the Hidden Web

Description:

Should interact with forms that were designed primarily for human consumption. Must provide input in the form of search queries ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 74
Provided by: michaelw8
Category:
Tags: crawling | form | hidden | web

less

Transcript and Presenter's Notes

Title: Crawling the Hidden Web


1
Crawling the Hidden Web
  • by
  • Michael Weinberg
  • mwmw_at_cs.huji.ac.il

Internet DB Seminar,
The Hebrew University of Jerusalem, School
of Computer Science and Engineering,
December 2001
2
Agenda
  • Hidden Web - what is it all about?
  • Generic model for a hidden Web crawler
  • HiWE (Hidden Web Exposer)
  • LITE Layout-based Information Extraction
    Technique
  • Results from experiments conducted to test these
    techniques

3
Web Crawlers
  • Automatically traverse the Web graph, building a
    local repository of the portion of the Web that
    they visit
  • Traditionally, crawlers have only targeted a
    portion of the Web called the publicly indexable
    Web (PIW)
  • PIW the set of pages reachable purely by
    following hypertext links, ignoring search forms
    and pages that require authentication

4
The Hidden Web
  • Recent studies show that a significant fraction
    of Web content in fact lies outside the PIW
  • Large portions of the Web are hidden behind
    search forms in searchable databases
  • HTML pages are dynamically generated in response
    to queries submitted via the search forms
  • Also referred as the Deep Web

5
The Hidden Web Growth
  • Hidden Web continues to grow, as organizations
    with large amount of high-quality information are
    placing their content online, providing
    web-accessible search facilities over existing
    databases
  • For example
  • Census Bureau
  • Patents and Trademarks Office
  • News media companies
  • InvisibleWeb.com lists over 10000 such databases

6
Surface Web
7
Deep Web
8
Deep Web Content Distribution
9
Deep Web Stats
  • The Deep Web is 500 times larger than PIW !!!
  • Contains 7,500 terabytes of information (March
    2000)
  • More than 200,000 Deep Web sites exist
  • Sixty of the largest Deep Web sites collectively
    contain about 750 terabytes of information
  • 95 of the Deep Web is publicly accessible (no
    fees)
  • Google indexes about 16 of the PIW, so we search
    about 0.03 of the pages available today

10
The Problem
  • Hidden Web contains large amounts of high-quality
    information
  • The information is buried on dynamically
    generated sites
  • Search engines that use traditional crawlers
    never find this information

11
The Solution
  • Build a hidden Web crawler
  • Can crawl and extract content from hidden
    databases
  • Enable indexing, analysis, and mining of hidden
    Web content
  • The content extracted by such crawlers can be
    used to categorize and classify the hidden
    databases

12
Challenges
  • Significant technical challenges in designing a
    hidden Web crawler
  • Should interact with forms that were designed
    primarily for human consumption
  • Must provide input in the form of search queries
  • How equip the crawlers with input values for use
    in constructing search queries?
  • To address these challenges, we adopt the
    task-specific, human-assisted approach

13
Task-Specificity
  • Extract content based on the requirements of a
    particular application or task
  • For example, consider a market analyst interested
    in press releases, articles, etc pertaining to
    the semiconductor industry, and dated sometime in
    the last ten years

14
Human-Assistance
  • Human-assistance is critical to ensure that the
    crawler issues queries that are relevant to the
    particular task
  • For instance, in the semiconductor example, the
    market analyst may provide the crawler with lists
    of companies or products that are of interest
  • The crawler will be able to gather additional
    potential company and product names as it
    processes a number of pages

15
Two Steps
  • There are two steps in achieving our goal
  • Resource discovery identify sites and databases
    that are likely to be relevant to the task
  • Content extraction actually visit the
    identified sites to submit queries and extract
    the hidden pages
  • In this presentation we do not directly address
    the resource discovery problem

16
Hidden Web Crawlers
17
User form interaction
Form page
(1) Download form
(2) View form
(4) Submit form
(3) Fill-out form
Web query front-end
(5) Download response
Hidden Database
(6) View result
Response page
18
Operation Model
  • Our model of a hidden Web crawler consists of
    four components
  • Internal Form Representation
  • Task-specific database
  • Matching function
  • Response Analysis
  • Form Page the page containing the search form
  • Response Page the page received in response to
    a form submission

19
Generic Operational Model
Hidden Web Crawler
Form page
Internal Form Representation
Form analysis
Download form
Match
Web query front-end
Form submission
Set of value-assignments
Download response
Response Analysis
Response page
20
Internal Form Representation
  • Form F
  • is a set of n form
    elements
  • S submission information associated with the
    form
  • submission URL
  • Internal identifiers for each form element
  • M meta-information about the form
  • web-site hosting the form
  • set of pages pointing to this form page
  • other text on the page besides the form

21
Task-specific Database
  • The crawler is equipped with a task-specific
    database D
  • Contains the necessary information to formulate
    queries relevant to the particular task
  • In the market analyst example, D could contain
    list of semiconductor company and product names
  • The actual format and organization of D are
    specific for to a particular crawler
    implementation
  • HiWE uses a set of labeled fuzzy sets

22
Matching Function
  • Matching algorithm properties
  • Input Internal form representation and current
    contents of the database D
  • Output Set of value assignments
  • associates value with element

23
Response Analysis
  • Module that stores the response page in the
    repository
  • Attempts to distinguish between pages containing
    search results and pages containing error
    messages
  • This feedback is used to tune the matching
    function

24
Traditional Performance Metric
  • Traditional crawlers performance metrics
  • Crawling speed
  • Scalability
  • Page importance
  • Freshness
  • These metrics are relevant to hidden web
    crawlers, but do not capture the fundamental
    challenges in dealing with the Hidden Web

25
New Performance Metrics
  • Coverage metric
  • Relevant pages extracted / relevant pages
    present in the targeted hidden databases
  • Problem difficult to estimate how much of the
    hidden content is relevant to the task

26
New Performance Metrics
  • the total number of forms that the
    crawler submits
  • num of submissions which result in
    response page with one or more search
    results
  • Problem the crawler is penalized if the database
    didnt contain any relevant search results

27
New Performance Metrics
  • number of semantically correct form
    submissions
  • Penalizes the crawler only if a form submission
    is semantically incorrect
  • Problem difficult to evaluate since a manual
    comparison is needed to decide whether the
    form is semantically correct

28
Design Issues
  • What information about each form element should
    the crawler collect?
  • What meta-information is likely to be useful?
  • How should the task-specific database be
    organized, updated and accessed?
  • What Match function is likely to maximize
    submission efficiency?
  • How to use the response analysis module to tune
    the Match function?

29
HiWE Hidden Web Exposer
30
Basic Idea
  • Extract descriptive information (label) for each
    element of a form
  • Task-specific database is organized in terms of
    categories, each of which is also associated with
    labels
  • Matching function attempts to match from form
    labels to database categories to compute a set of
    candidate values assignments

31
HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
32
HiWEs Main Modules
  • URL List
  • contains all the URLs the crawler has discovered
    so far
  • Crawl Manager
  • controls the entire crawling process
  • Parser
  • extracts hypertext links from the crawled pages
    and adds them to the URL list
  • Form Analyzer, Form Processor, Response Analyzer
  • Together implement the form processing and
    submission operations

33
HiWEs Main Modules
  • LVS Manager
  • Manages additions and accesses to the LVS table
  • LVS table
  • HiWEs implementation of the task-specific
    database

34
HiWEs Form Representation
  • Form
  • The third component of F is an empty set since
    current implementation of HiWE does not collect
    any meta-information about the form
  • For each element , HiWE collects a domain
    Dom( ) and a label label( )

35
HiWEs Form Representation
  • Domain of an element
  • Set of values which can be associated with the
    corresponding form element
  • May be a finite set (e.g., domain of a selection
    list)
  • May be infinite set (e.g., domain of a text box)
  • Label of an element
  • The descriptive information associated with the
    element, if any
  • Most forms include some descriptive text to help
    users understand the semantics of the element

36
Form Representation - Figure
Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
Element E2
Label(E2) "Company Name" Dom(E2) s s is a
text string
Element E3
Label(E3) "Sector" Dom(E3) Entertainment,
Automobile
Information Technology, Construction
37
HiWEs Task-specific Database
  • Task-specific information is organized in terms
    of a finite set of concepts or categories
  • Each concept has one or more labels and an
    associated set of values
  • For example the label Company Name could be
    associated with the set of values IBM,
    Microsoft, HP,

38
HiWEs Task-specific Database
  • The concepts are organized in a table called the
    Label Value Set (LVS)
  • Each entry in the LVS is of the form (L,V)
  • L label
  • fuzzy set of values
  • Fuzzy set V has an associated membership function
    that assigns weights, in the range 0,1 to each
    member of the set
  • is a measure of the crawlers
    confidence that the assignment of to E is
    semantically meaningful

39
HiWEs Matching Function
  • For elements with a finite domain
  • The set of possible values is fixed and can be
    exhaustively enumerated
  • In this example, the crawler can first retrieve
    all relevant articles, then all relevant press
    releases and finally all relevant reports

Element E1
Label(E1) "Document Type" Dom(E1 ) Articles,
Press Releases,
Reports
40
HiWEs Matching Function
  • For elements with an infinite domain
  • HiWE textually matches the labels of these
    elements with labels in the LVS table
  • For example, if a textbox element has the label
    Enter State which best matches an LVS entry
    with the label State , the values associated
    with that LVS entry (e.g., California) can be
    used to fill the textbox
  • How do we match Form labels with LVS labels?

41
Label Matching
  • Two steps in matching Form labels with LVS
    labels
  • 1. Normalization includes conversion to a common
    case and standard style
  • 2. Use of an approximate string matching
    algorithm to compute minimum edit distances
  • HiWE employs D. Lopresti and A. Tomkins string
    matching algorithm that takes word reordering
    into account

42
Label Matching
  • Let LabelMatch( ) denote the LVS entry with
    the minimum distance to label( )
  • Threshold
  • If all LVS entries are more than edit
    operations away from label( ) , LabelMatch(
    ) nil

43
Label Matching
  • For each element , compute ( , )
  • If has an infinite domain and (L,V) is the
    closest matching LVS entry, then V and
  • If has a finite domain, then Dom( )
    and
  • The set of value assignments is computed as the
    product of all the s
  • Too many assignments?

44
Ranking Value Assignments
  • HiWE employs an aggregation function to compute a
    rank for each value assignment
  • Uses a configurable parameter, a minimum
    acceptable value assignment rank ( )
  • The intent is to improve submission efficiency by
    only using high-quality value assignments
  • We will show three possible aggregation functions

45
Fuzzy Conjunction
  • The rank of a value assignment is the minimum of
    the weights of all the constituent values.
  • Very conservative in assigning ranks. Assigns a
    high rank only if each individual weight is high

46
Average
  • The rank of a value assignment is the average of
    the weights of the constituent values
  • Less conservative than fuzzy conjunction

47
Probabilistic
  • This ranking function treats weights as
    probabilities
  • is the likelihood that the choice of
    is useful and is the
    likelihood that it is not
  • The likelihood of a value assignment being useful
    is
  • Assigns low rank if all the individual weights
    are very low

48
Populating the LVS Table
  • HiWE supports a variety of mechanisms for adding
    entries to the LVS table
  • Explicit Initialization
  • Built-in entries
  • Wrapped data sources
  • Crawling experience

49
Explicit Initialization
  • Supply labels and associated value sets at
    startup time
  • Useful to equip the crawler with labels that the
    crawler is most likely to encounter
  • In the semiconductor example, we supply HiWE
    with a list of relevant company names and
    associate the list with labels Company ,
    Company Name

50
Built-in Entries
  • HiWE has built-in entries for commonly used
    concepts
  • Dates and Times
  • Names of months
  • Days of week

51
Wrapped Data Sources
  • LVS Manager can query data sources through a
    well-defined interface
  • The data source must be wrapped by a program
    that supports two kinds of queries
  • Given a set of labels, return a value set
  • Given a set of values, return other values that
    belong to the same value set

52
HiWE Architecture
URL 1 URL 2 URL N
URL List
LVS Table
WWW
Label1 Value-Set1 Label2 Value-Set2 Labeln
Value-Setn
Parser
Crawl Manager
Form Analyzer
Form submission
LVS Manager
Form Processor
Feedback
Response
Response Analyzer
Custom data sources
53
Crawling Experience
  • Finite domain form elements are a useful source
    of labels and associated value sets
  • HiWE adds this information to the LVS table
  • Effective when similar label is associated with a
    finite domain element in one form and with an
    infinite domain element in another

54
Computing Weights
  • New value added to the LVS must be assigned a
    suitable weight
  • Explicit initialization and build-in values have
    fixed weights
  • Values obtained from external data sources or
    through the crawlers own activity, are assigned
    weights that vary with time

55
Initial Weights
  • For external data sources - computed by the
    respective wrappers
  • For values directly gathered by the crawler
  • Finite domain element E with Dom(E)
  • 1 iff
  • Three cases arise when incorporating Dom(E) into
    the LVS table

56
Updating LVS Case 1
  • Crawler successfully extracts label(E) and
    computes LabelMatch(E)(L,V)
  • Replace the (L,V) entry by the entry
  • Intuitively, Dom(E) provides new elements to the
    value set and boosts the weights of existing
    elements

57
Updating LVS Case 2
  • Crawler successfully extracts label(E) but
    LabelMatch(E) nil
  • A new entry ( label(E),Dom(E) ) is created in the
    LVS

58
Updating LVS Case 3
  • Crawler can not extract label(E)
  • For each entry (L,V)
  • Compute a score
  • Identify the entry with the maximum score
  • Identify the value of the maximum score
  • Replace entry with new entry
  • Confidence of new values

59
Configuring HiWE
  • Initialization of the crawling activity includes
  • Set of sites to crawl
  • Explicit initialization for the LVS table
  • Set of data sources
  • Label matching threshold
  • Minimum acceptable value assignment rank
  • Value assignment aggregation function

60
Introducing LITE
  • Layout-based Information Extraction Technique
  • Physical Layout of a page is also used to aid in
    extraction
  • For example, a piece of text that is physically
    adjacent to a form element is very likely a
    description of that element
  • Unfortunately, this semantic associating is not
    always reflected in the underlying HTML of the
    Web page

61
Layout-based Information Extraction Technique
62
The Challenge
  • Accurate extraction of the labels and domains of
    form elements
  • Elements that are visually close on the screen,
    may be separated arbitrarily in the actual HTML
    text
  • Even when HTML provides a facility for semantic
    relationships, its not used in a majority of
    pages
  • Accurate page layout is a complex process
  • Even a crude approximate layout of portions of a
    page, can yield very useful semantic information

63
Form Analysis in HiWE
  • LITE-based heuristic
  • Prune the form page and isolate elements which
    directly influence the layout
  • Approximately layout the pruned page using a
    custom layout engine
  • Identify the pieces of text that are physically
    closest to the form element (these are
    candidates)
  • Rank each candidate using a variety of measures
  • Choose the highest ranked candidate as the label

64
Pruning Before Partial Layout
65
LITE - Figure
  • Key Idea in LITE
  • Physical page layout embeds significant
    semantic information

DOM Parser
DOM Representation
DOM API
Prune
Pruned Page
List of Elements Submission Info
Partial Layout
Labels Domain Values
Internal Form Representation
66
Experiments
  • A number of experiments were conducted to study
    the performance of HiWE
  • We will see how performance depends on
  • Minimum form size
  • Crawler input to LVS table
  • Different ranking functions

67
Parameter Values for Task 1
  • Task 1
  • News articles, reports, press releases and
    white papers relating to the semiconductor
    industry, dated sometime in the last ten years

68
Variation of Performance with
69
Effect of Crawler input to LVS
70
Different Ranking Functions
  • When using and the crawlers
    submission efficiency is mostly above 80
  • performs poorly
  • submits more forms than (less
    conservative)

71
Label Extraction
  • LITE-based heuristic achieved overall accuracy of
    93
  • The test set was manually analyzed

72
Conclusion
  • Addressed the problem of extending current-day
    crawlers to build repositories that include pages
    from the Hidden Web
  • Presented a simple operation model of a hidden
    web crawler
  • Described the implementation of a prototype
    crawler HiWE
  • Introduced a technique for Layout-based
    information extraction

73
Bibliography
  • Crawling the Hidden Web, by S. Raghavan and H.
    Garcia-Molina, Stanford University, 2001
  • BrightPlanet.com white papers
  • D. Lopresti and A. Tomkins. Block edit models for
    approximate string matching
Write a Comment
User Comments (0)
About PowerShow.com