Information Extraction Introduction and Tools - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Information Extraction Introduction and Tools

Description:

Information Extraction -Introduction and Tools. V. ... Capitol Hill 1 br twnhme. ... incl gar, grt N. Hill loc $995. (206)999-9999. Capitol Hill 1 br twnhme. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 22

Provided by: vgvinodvy

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction Introduction and Tools

1
Information Extraction -Introduction and Tools

V.G.Vinod Vydiswaran
Roll no. 02329011
M.Tech (1st Year)
KReSIT, IITBombay
29th October 2002
Guided by Prof. S. Sarawagi

2
Introduction

What is Information Extraction (IE) ?
To select desired fields from the given data, by
extracting common patterns that appear along with
the information.
To automate such a process.
To make the process efficient by reducing the
training data required, so as to restrict the
cost.

3
Motivation

Abundant online data available.
Most IE systems specific to single information
resource.
IE models usually hand-coded, and hence
error-prone.
Data available either in structured form or in
highly verbose content. Proper filters needed.

4
Types of Data

Based on text styles
Structured data
Semi-Structured text
Plain text
Based on information to the model
Labeled
Unlabeled

5
Structured Data

Relational Data
Data in databases, in tables
HTML Tags
Query responses translated into Relational form
using Wrappers
Usually hand-coded and very specific to
information resource

6
Wrapper Induction

Wrapper
Procedure extracting tuples from a particular
information source
A function from page to set of tuples
Induction
Task of generalizing from labeled examples to a
hypothesis function of labeling instances

7
Wrapper Identification

ExtractCCs (page P)
skip past first occurrence of ltPgt in P
while next ltBgt is before next ltHRgt in P
for each (lk, rk) ?
(ltBgt,lt/Bgt), (ltIgt, lt/Igt)
skip past next occurrence of lk in P
extract attribute from P to next
occurrence of rk
return extracted tuples

ExtractHLRT(page P,lth,t,l1,r1,,lk,rkgt) skip
past first occurrence of h in P while next l1
is before next t in P for each (lk, rk)
? (l1,r1), , (lk,rk) skip
past next occurrence of lk in P extract
attribute from P to next occurrence of rk
return extracted tuples HLRT Head Left
Right Tail

ltHTMLgtltHEADgt
ltTITLEgtCountry Codeslt/TITLEgt
lt/HEADgt
ltBODYgt
ltBgtSome Country Codeslt/Bgt
ltPgt
ltBgtCongolt/Bgt ltIgt242lt/IgtltBRgt
ltBgtEgyptlt/Bgt ltIgt20lt/IgtltBRgt
ltBgtIndialt/Bgt ltIgt91lt/IgtltBRgt
ltBgtSpainlt/Bgt ltIgt34lt/IgtltBRgt
ltHRgt
ltBgtEndlt/Bgt
lt/BODYgt
lt/HTMLgt

8
Constructing Wrappers

Problem
Given a supply of example query responses, learn
a wrapper for the information resource that
generated them.
Instances
Labels
Hypotheses
Oracles
Page Oracles Label Oracles
Probably Approximately Correct (PAC) Analysis
Input accuracy (e) and confidence (d)

9
Building HLRT Wrappers

BuildHLRT (labeled pages ? , ltPn, Lngt,)
for k 1 to K
rk ? any common suffix of strings following
each (but not contained in any) attribute k
for k 2 to K
lk ? any common suffix of strings preceding
each attribute k
for each common suffix l1 of pages heads
for each common substring h of pages heads
for each common substring t of pages
tails
if (a) h precedes l1 in each of
pages heads
(b) t precedes l1 in each of
pages tails
(c) t occurs between h and l1 in
none of pages heads
(d) l1 doesnt follow t in any
inter-tuple separator
then return (h, t, l1, r1, , lK, rK)

T ?
repeat
Pn ? PageOracle()
Ln ? LabelOracle(Pn)
T ? T U (Pn , Ln )
? ? BuildHLRT()
until Pr E(?) lt e gt 1 d
return ?
The wrapper induction algorithm

10
Semi-Structured Text

Telegraphic messages
Advertisements
Rule (WHISK Algorithm)
Pattern ( Nghbr ) ( Digit ) Bdrm
( Number )
Output Rental Neighbourhood 1 Bedrooms 2
Price 3
Bdrm (brsbrbdrmbdbedroomsbedroombedBR)

Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
11
Free text

Newspaper reports
Example Management succession
Input text
C.Vincent Protho, chairman and chief executive
officer of this maker of semiconductors, was
named to the additional post of president,
succeeding John W. Smith, who resigned to pursue
other interests.
Succession Event
Person In C.Vincent Protho
Person Out John W. Smith
Post president
Much difficult needs syntax analysis

12
Hidden Markov Models

Probability of abb
0.3 1 0.7 0.5 0.8 1 ( 0.084 )
0.7 0.5 0.2 0.7 0.8 1 (
0.0392)
0.1232

Maximum Probability Path
0.3 1 0.7 0.5 0.8 1 ( 0.084 )
13
Advantages of HMM

Strong statistical foundation
Robust handling of new data
Computationally efficient to develop and evaluate
A priori notion of model topology
Large amount of training data

Disadvantages of HMM
14
HMM use in IE

Enumeration of paths takes Exponential time
There exists an algorithm based on Dynamic
programming called Viterbi algorithm that finds
the most likely path in quadratic time using
Trellis structure.
Each state associated with class to be extracted.
Each state emits words from class-specific
unigram.

15
Training the HMM

Multiple states per class better to model hidden
sequence structure
Each state associated with one word and the
associated class
Simplification of model possible by Neighbour
merging and V-merging.
Model structure can be learned automatically from
data using Bayesian model merging

16
Labeled, Unlabeled Distantly labeled data

Asking Oracle to label
is usually costly
Use of counts for
probability
Unlabeled data Baum-Welsh Training algorithm
using Trellis structure in HMM
Distantly labeled data only relevant portion of
fields used. e.g. BibTeX entries.

17
Experiment

Problem definition
To extract key fields such as title, author,
affiliation, email, abstract, keywords and
introduction, from the header of scientific
research papers
Types
LD interpolated distantly labeled data with
labeled data
LD labeled and unlabeled
ML maximum likelihood

18
Results

Observations
ML and self better than full
Distantly labeled data helps
Smoothing gives best results
Models with multiple states per class outperform
ML model with 92.9 accuracy

19
Conclusion

Wrapper Induction and WHISK are multi-slot
systems having high and near-perfect accuracy for
structured and semi-structured data.
For free text, WHISK algorithm falls considerably
short of human capability, but still useful.
HMM and other probabilistic modeling tools can
also be learned for IE using efficient algorithms
like Viterbi and Baum-Welsh.

20
References

Wrapper Induction for Information Extraction N.
Kushmerick, D. S. Weld, R. Doorenbos IJCAI-97
Learning Information Extraction Rules for
Semi-structured and Free Text Stephan Soderland
Introduction to Discrete Hidden Morkov Models
Peter de Souza, 1997
Learning HMM Structure for IE K. Seymore, A.
McCallum, R. Rosenfeld

21
Thank you