Automating the Extraction of Data Behind Web Forms - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Automating the Extraction of Data Behind Web Forms

Description:

Automating the Extraction of Data. Behind Web Forms. by. Sai Ho Yau. Brigham Young University ... automatically: Fill in Web forms. Extract information behind ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 15
Provided by: saih
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automating the Extraction of Data Behind Web Forms


1
Automating the Extraction of Data Behind Web
Forms
  • by
  • Sai Ho Yau
  • Brigham Young University

2
Introduction
  • There are enormous amounts of information
    available from the Web, but it is difficult to
    extract the data automatically due to several
    reasons
  • Web information is stored in databases
  • Form interfaces
  • Relevant information can be obtained only after a
    Web form is filled out and submitted

3
Problems Dealing with Forms
  • No general Web form design
  • Required text fields
  • One form may lead to another
  • Resulting information embedded within forms
  • Returned error messages versus valid data
  • Elimination of possible duplicate data

4
The Framework
5
Tools
  • Language and Internet browser used
  • JavaScript, Java, PHP, MySQL
  • Microsoft Internet Explorer
  • Platform
  • Solaris Intel (Unix), with Sun Java.

6
Method Construct the Query String
7
Method Construct the Query String
8
The Goal
Automatically extract data behind Web forms
The system
  • Fills in HTML forms
  • Retrieves data
  • Eliminates duplicates

9
Returned Web Page
10
Suggested Solution
  • Two phases to deal with many possible responses
    to a query
  • Sampling phase
  • Exhaustive phase

Assuming no HTTP error
11
Sampling Phase
  • Submit the default form.
  • Randomly select N form-field settings and submit
    the form N times.
  • If no new information, STOP and send the result
    downstream (N is set so that the probability of
    subsequent submissions yielding new data is less
    than 5).
  • Otherwise, ENTER the Exhaustive Phase.

12
Exhaustive Phase
  • Estimate the total time and quantity of data.
  • If below threshold, exhaustively obtain the rest
    of the information.
  • Otherwise, return the results of the sampling and
    report to the user the estimate of time and
    quantity of data.

13
Data Retrieving Strategy
  • Locate possible duplicate information from
    subsequent retrieved Web pages during Sampling
    and Exhaustive Phases.
  • Discard duplicates and merge new information.
  • Send fully merged data downstream.

14
Conclusions
We can automatically
  • Fill in Web forms.
  • Extract information behind forms.
  • Screen out errors.
  • Eliminate duplicate data and merge resulting
    information.
Write a Comment
User Comments (0)
About PowerShow.com