Information Extraction Introduction and Tools - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Information Extraction Introduction and Tools

Description:

Information Extraction -Introduction and Tools. V. ... Capitol Hill 1 br twnhme. ... incl gar, grt N. Hill loc $995. (206)999-9999. Capitol Hill 1 br twnhme. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 22
Provided by: vgvinodvy
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction Introduction and Tools


1
Information Extraction -Introduction and Tools
  • V.G.Vinod Vydiswaran
  • Roll no. 02329011
  • M.Tech (1st Year)
  • KReSIT, IITBombay
  • 29th October 2002
  • Guided by Prof. S. Sarawagi

2
Introduction
  • What is Information Extraction (IE) ?
  • To select desired fields from the given data, by
    extracting common patterns that appear along with
    the information.
  • To automate such a process.
  • To make the process efficient by reducing the
    training data required, so as to restrict the
    cost.

3
Motivation
  • Abundant online data available.
  • Most IE systems specific to single information
    resource.
  • IE models usually hand-coded, and hence
    error-prone.
  • Data available either in structured form or in
    highly verbose content. Proper filters needed.

4
Types of Data
  • Based on text styles
  • Structured data
  • Semi-Structured text
  • Plain text
  • Based on information to the model
  • Labeled
  • Unlabeled

5
Structured Data
  • Relational Data
  • Data in databases, in tables
  • HTML Tags
  • Query responses translated into Relational form
    using Wrappers
  • Usually hand-coded and very specific to
    information resource

6
Wrapper Induction
  • Wrapper
  • Procedure extracting tuples from a particular
    information source
  • A function from page to set of tuples
  • Induction
  • Task of generalizing from labeled examples to a
    hypothesis function of labeling instances

7
Wrapper Identification
  • ExtractCCs (page P)
  • skip past first occurrence of ltPgt in P
  • while next ltBgt is before next ltHRgt in P
  • for each (lk, rk) ?
  • (ltBgt,lt/Bgt), (ltIgt, lt/Igt)
  • skip past next occurrence of lk in P
  • extract attribute from P to next
    occurrence of rk
  • return extracted tuples

ExtractHLRT(page P,lth,t,l1,r1,,lk,rkgt) skip
past first occurrence of h in P while next l1
is before next t in P for each (lk, rk)
? (l1,r1), , (lk,rk) skip
past next occurrence of lk in P extract
attribute from P to next occurrence of rk
return extracted tuples HLRT Head Left
Right Tail
  • ltHTMLgtltHEADgt
  • ltTITLEgtCountry Codeslt/TITLEgt
  • lt/HEADgt
  • ltBODYgt
  • ltBgtSome Country Codeslt/Bgt
  • ltPgt
  • ltBgtCongolt/Bgt ltIgt242lt/IgtltBRgt
  • ltBgtEgyptlt/Bgt ltIgt20lt/IgtltBRgt
  • ltBgtIndialt/Bgt ltIgt91lt/IgtltBRgt
  • ltBgtSpainlt/Bgt ltIgt34lt/IgtltBRgt
  • ltHRgt
  • ltBgtEndlt/Bgt
  • lt/BODYgt
  • lt/HTMLgt

8
Constructing Wrappers
  • Problem
  • Given a supply of example query responses, learn
    a wrapper for the information resource that
    generated them.
  • Instances
  • Labels
  • Hypotheses
  • Oracles
  • Page Oracles Label Oracles
  • Probably Approximately Correct (PAC) Analysis
  • Input accuracy (e) and confidence (d)

9
Building HLRT Wrappers
  • BuildHLRT (labeled pages ? , ltPn, Lngt,)
  • for k 1 to K
  • rk ? any common suffix of strings following
    each (but not contained in any) attribute k
  • for k 2 to K
  • lk ? any common suffix of strings preceding
    each attribute k
  • for each common suffix l1 of pages heads
  • for each common substring h of pages heads
  • for each common substring t of pages
    tails
  • if (a) h precedes l1 in each of
    pages heads
  • (b) t precedes l1 in each of
    pages tails
  • (c) t occurs between h and l1 in
    none of pages heads
  • (d) l1 doesnt follow t in any
    inter-tuple separator
  • then return (h, t, l1, r1, , lK, rK)
  • T ?
  • repeat
  • Pn ? PageOracle()
  • Ln ? LabelOracle(Pn)
  • T ? T U (Pn , Ln )
  • ? ? BuildHLRT()
  • until Pr E(?) lt e gt 1 d
  • return ?
  • The wrapper induction algorithm

10
Semi-Structured Text
  • Telegraphic messages
  • Advertisements
  • Rule (WHISK Algorithm)
  • Pattern ( Nghbr ) ( Digit ) Bdrm
    ( Number )
  • Output Rental Neighbourhood 1 Bedrooms 2
    Price 3
  • Bdrm (brsbrbdrmbdbedroomsbedroombedBR)

Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
11
Free text
  • Newspaper reports
  • Example Management succession
  • Input text
  • C.Vincent Protho, chairman and chief executive
    officer of this maker of semiconductors, was
    named to the additional post of president,
    succeeding John W. Smith, who resigned to pursue
    other interests.
  • Succession Event
  • Person In C.Vincent Protho
  • Person Out John W. Smith
  • Post president
  • Much difficult needs syntax analysis

12
Hidden Markov Models
  • Probability of abb
  • 0.3 1 0.7 0.5 0.8 1 ( 0.084 )
  • 0.7 0.5 0.2 0.7 0.8 1 (
    0.0392)
  • 0.1232

Maximum Probability Path
0.3 1 0.7 0.5 0.8 1 ( 0.084 )
13
Advantages of HMM
  • Strong statistical foundation
  • Robust handling of new data
  • Computationally efficient to develop and evaluate
  • A priori notion of model topology
  • Large amount of training data

Disadvantages of HMM
14
HMM use in IE
  • Enumeration of paths takes Exponential time
  • There exists an algorithm based on Dynamic
    programming called Viterbi algorithm that finds
    the most likely path in quadratic time using
    Trellis structure.
  • Each state associated with class to be extracted.
  • Each state emits words from class-specific
    unigram.

15
Training the HMM
  • Multiple states per class better to model hidden
    sequence structure
  • Each state associated with one word and the
    associated class
  • Simplification of model possible by Neighbour
    merging and V-merging.
  • Model structure can be learned automatically from
    data using Bayesian model merging

16
Labeled, Unlabeled Distantly labeled data
  • Asking Oracle to label
  • is usually costly
  • Use of counts for
  • probability
  • Unlabeled data Baum-Welsh Training algorithm
    using Trellis structure in HMM
  • Distantly labeled data only relevant portion of
    fields used. e.g. BibTeX entries.

17
Experiment
  • Problem definition
  • To extract key fields such as title, author,
    affiliation, email, abstract, keywords and
    introduction, from the header of scientific
    research papers
  • Types
  • LD interpolated distantly labeled data with
    labeled data
  • LD labeled and unlabeled
  • ML maximum likelihood

18
Results
  • Observations
  • ML and self better than full
  • Distantly labeled data helps
  • Smoothing gives best results
  • Models with multiple states per class outperform
    ML model with 92.9 accuracy

19
Conclusion
  • Wrapper Induction and WHISK are multi-slot
    systems having high and near-perfect accuracy for
    structured and semi-structured data.
  • For free text, WHISK algorithm falls considerably
    short of human capability, but still useful.
  • HMM and other probabilistic modeling tools can
    also be learned for IE using efficient algorithms
    like Viterbi and Baum-Welsh.

20
References
  • Wrapper Induction for Information Extraction N.
    Kushmerick, D. S. Weld, R. Doorenbos IJCAI-97
  • Learning Information Extraction Rules for
    Semi-structured and Free Text Stephan Soderland
  • Introduction to Discrete Hidden Morkov Models
    Peter de Souza, 1997
  • Learning HMM Structure for IE K. Seymore, A.
    McCallum, R. Rosenfeld

21
Thank you
  • for your interest and
  • patient hearing
Write a Comment
User Comments (0)
About PowerShow.com