Learning the Common Structure of Data Kristina Lerman and Steven Minton - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Learning the Common Structure of Data Kristina Lerman and Steven Minton

Description:

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth Introduction Data Extraction Data Expectation Goals Wrapper ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 10
Provided by: JeffR438
Category:

less

Transcript and Presenter's Notes

Title: Learning the Common Structure of Data Kristina Lerman and Steven Minton


1
Learning the Common Structure of DataKristina
Lerman and Steven Minton
  • Presentation by Jeff Roth

2
Introduction
  • Data Extraction
  • Data Expectation
  • Goals
  • Wrapper Verification
  • Wrapper Maintenance

3
Representation
  • Break web page into tokens more general than
    characters but more specific than word symbols

4
DataPro
  • For complex fields it is sufficient to learn only
    the starting and ending sequences of a data field
  • DataPro
  • Only Positive Examples
  • Statistical Algorithm
  • Polynomial Time
  • Greedy

5
Prefix Tree
  • For a given data field, the tokens are encoded in
    a prefix tree (a suffix tree would be similar)
  • Each node is a specification of its parent.
  • Example data field is City
  • node New
  • children Haven, York, CAPS

6
Significant(count1, count2, P, a)
  • Significance is the main measure used in the
    DataPro algorithm
  • Parameters
  • count1, count2 - number of times a pattern of
    tokens appear in the data field examples
  • P - probability of count1 given count2
  • a - null hypothesis limit

7
DataPro Algorithm
  • Create root node of tree
  • For next node Q of tree
  • Create children of Q
  • Prune Generalizations
  • Determinize children
  • Extract patterns from tree

8
Wrapper Verification
  • Wrapper Fragility is a common problem and wrapper
    verification is rare
  • Take patterns created by DataPro for the current
    wrapper and create a distribution t from the
    number of pattern matches of each pattern on the
    original web pages
  • Take a similar distribution k from the new web
    pages that are being verified
  • if t and k have approximately the same
    distribution the wrapper is still valid,
    otherwise it needs to be updated
  • Recall 95 Precision 47

9
Wrapper Maintenance
  • Take original patterns
  • Find matching start and end patterns
  • Remove sequences with unusually high or low
    length
  • Score remaining sequences based on location,
    adjacent tokens, and visibility to the user
  • Cluster choices by score and highest scoring
    cluster should contain only correct examples of
    the data field
  • 62 of 77 tests contained Correct Complete data
    field examples
Write a Comment
User Comments (0)
About PowerShow.com