Extracting tabular data from the Web - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Extracting tabular data from the Web

Description:

Extracting tabular data from the Web. Limitations of the current BP screen scraper. ... Need to rewrite code for fetching & parsing HTML pages from different ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 21
Provided by: chai5
Category:
Tags: data | extracting | tabular | web

less

Transcript and Presenter's Notes

Title: Extracting tabular data from the Web


1
Extracting tabular data from the Web
2
Limitations of the current BP screen scraper.
  • Parsing is done line by line.
  • Pattern matching not very accurate
    unpredictable.
  • Need to rewrite code for fetching parsing HTML
    pages from different websites(Eg. MSAMB -
    Maharashtra, Krishi Marata Vahini
    Karnataka,etc.)
  • Doesnt take care of misplaced tags.

3
Characteristics of a Solution to this problem
  • Flexible.
  • Unicode Compliant.
  • Smarter pattern matching explore the structure
    of the HTML page rather than single line at a
    time.

4
Possible Solutions
5
Solution 1
  • Step 1 Fetch data from the desired site.
  • Step 2 Tidy the HTML page.
  • Step 3 Construct the HTML DOM(Document Object
    Model) tree.
  • Step 4 Extract node information using Document
    object.

6
Solution 2
  • Similar to Solution 1
  • Use XPath to locate data(Step 4).
  • Relative position of nodes in DOM tree stored as
    XPath.
  • These XPaths are stored in the properties file
    instead of the entire table structure.

7
Solution 3
  • Tested a software - screen-scraper.(www.screen-sc
    raper.com)
  • Proxy server that allows the contents of HTTP and
    HTTPS requests to be viewed
  • Engine that can be configured to extract
    information from Web sites using special patterns
    and regular expressions.
  • Embedded scripting engine that allows extracted
    data to be manipulated, written out to a file, or
    inserted into a database.
  • It can be used with PHP, Java, or any
    COM-friendly language such as Visual Basic or
    Active Server Pages.
  • Costs 90 !
  • No Unicode support.

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Other Possible Solutions
  • XMLize the HTML content.
  • XML more structured and well-formed.
  • Data interchange between incompatible systems.
  • Can use XSL and XSLT to convert from one form to
    another.

12
Implementation
13
HTML scraper
  • The HTML scraper has 3 main steps
  • 1.Downloading the web page using crawlers like
    wget.
  • 2.Parsing and constructing the DOM tree.
  • 3.Querying the DOM tree for retrieving the
    desired information and inserting to the database.

14
Implementation
  • Download the web page using
  • wget --post-datadata www.agmarknet.nic.in
  • Can store the page locally.
  • Construct DOM tree using JTidy API.
  • Tidy tidy new Tidy()
  • Parse the DOM tree
  • Document doc tidy.parseDOM(htmlfile,null)

15
  • Query the DOM tree
  • Depth First Search through the DOM tree
  • Or
  • Using the XPath APIs.
  • Store the HTML page structure in file and use
    DFS.
  • Or
  • Store XPaths and use it for querying.
  • Insert into database using JDBC.

16
DOM tree of the parsed HTML page
html
head
table
tr
tr
tr
tr
APMC
Arrivals
Variety
Low Rate
Mid Rate
High Rate
17
(No Transcript)
18
(No Transcript)
19
Statistics
  • Total time taken by the new parser is less than
    15 seconds per page. But the old one is more than
    30 seconds.
  • Daily data fetching time(20015)seconds

20
  • Parser (using DFS) for NIC and MSAMB (both
    English and Marathi) are ready .
Write a Comment
User Comments (0)
About PowerShow.com