Wrapper Induction for Information Extraction

About This Presentation

Title:

Description:

Number of Views:238

Avg rating:3.0/5.0

Slides: 19

Provided by: ree2

Category:

Tags: conditions | extraction | induction | information | wrapper

Transcript and Presenter's Notes

Title: Wrapper Induction for Information Extraction

1
Wrapper Induction for Information Extraction

2
Outline

3

4

5
(No Transcript)
6
Constructing wrapper by induction

Induction is the task of generalizing from
labeled example to a hypothesis, a function for
labeling instances.
The wrapper construction problem is the
following given a supply of example query
responses, learn a wrapper for the information
resource that generated them .

Wrapper Induction Details
Instances
Labels
Hypotheses
Oracles correspond to sources of example query
responses and their labels. We split the
traditional oracle into two parts.
PageOracle generates example pages.
LabelOracle produces correct labels for
these instances.
PAC analysis answers the question, How many
examples must a learner see to be confident that
its hypothesis is good enough-i.e., to be
probably approximately correct?'

8
(No Transcript)
9
Composing oracles

LabelOracle is provided as input.
Recognizers finds instances of a particular
attribute on a page.
These recognized instances are then corroborated
to label the entire page.
For example, given a recognizer for countries and
another for country codes, corroboration produces
an oracle that labels pages containing pairs of
these attributes.

10
Recognizer types

Perfect accept all positive instances and reject
all negative instances of their target attribute.
Incomplete reject all negative instances but
reject some positive instances.
Unsound accept all positive instances but
accept some negative instances.
unreliable reject some positive instances and
accept some negative instances.

11
(No Transcript)
12
Empirical evaluation I

13
Empirical evaluation II.

Another experiment measures the robustness of the
system to the recognizers' error rates.
The system was tested on
(i) the OKRA email service
(ii) the BIGBOOK telephone directory.
OKRA has four attributes
BIGBOOK has six.
Runs the system with these perfect recognizers.
two termination conditions
(a) we ran the system until the PAC criteria was
satisfied
(b) we required that the learned wrapper be 100
correct on a suite of test pages.

14

15

16
(No Transcript)
17

18
Conclusions

Write a Comment

User Comments (0)

About PowerShow.com

Wrapper Induction for Information Extraction - PowerPoint PPT Presentation