Automating the Extraction of Data Behind Web Forms - PowerPoint PPT Presentation

About This Presentation
Title:

Automating the Extraction of Data Behind Web Forms

Description:

There are enormous amounts of information available from ... Method: Construct the. Query String. Next. Previous. Returned Web Page. Next. Previous. Solutions ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 19
Provided by: saih
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automating the Extraction of Data Behind Web Forms


1
Automating the Extraction of Data Behind Web
Forms
  • Brigham Young University
  • Sai Ho Yau

2
Hurdles Against Automating Data Extraction
  • There are enormous amounts of information
    available from the Web, but it is difficult to
    extract the data automatically due to several
    reasons
  • Web information is stored in databases
  • Form interfaces
  • Relevant information can be obtained only after a
    Web form is filled out and submitted

3
Problems Dealing with Forms
  • No general Web form design
  • Required text fields
  • One form may lead to another
  • Resulting information embedded within forms
  • Returned error messages versus valid data
  • Elimination of possible duplicate data

4
Motivations
We want to automatically
  • Fill in Web forms.
  • Extract information behind forms.
  • Screen out errors.
  • Eliminate duplicate data and merge resulting
    information.

5
The Framework
6
Method Construct the Query String
7
Method Construct the Query String
8
Method Construct the Query String
9
Returned Web Page
10
Solutions
  • Two phases to deal with many possible responses
    to a query
  • Sampling phase
  • Exhaustive phase

Assuming no HTTP error
11
Sampling Phase
  • Submit the default form.
  • Randomly select N form-field settings and submit
    the form N times.
  • If no new information, STOP and send the result
    downstream (N is set so that the probability of
    subsequent submissions yielding new data is less
    than 5).
  • Otherwise, ENTER the Exhaustive Phase.

12
Exhaustive Phase
  • Estimate the total time and quantity of data.
  • If below threshold, exhaustively obtain the rest
    of the information.
  • Otherwise, return the results of the sampling and
    report to the user the estimate of time and
    quantity of data.

13
Data Retrieving Strategy
  • Locate possible duplicate information from
    subsequent retrieved Web pages during Sampling
    and Exhaustive Phases.

14
Retrieved Web Pages
15
Data Retrieving Strategy
  • Locate possible duplicate information from
    subsequent retrieved Web pages during Sampling
    and Exhaustive Phases.
  • Discard duplicates and merge new information.

16
Duplicates Discarded and New Information Merged
17
Data Retrieving Strategy
  • Locate possible duplicate information from
    subsequent retrieved Web pages during Sampling
    and Exhaustive Phases.
  • Discard duplicates and merge new information.
  • Send fully merged data downstream for data
    extraction.

18
Conclusions
We can automate data extraction process by
automatically
  • Fill in Web forms.
  • Retrieve information behind forms.
  • Handle errors.
  • Filter duplicate data and merge resulting
    information.
Write a Comment
User Comments (0)
About PowerShow.com