Unsupervised Strategies for Information Extraction by Text Segmentation - PowerPoint PPT Presentation

About This Presentation
Title:

Unsupervised Strategies for Information Extraction by Text Segmentation

Description:

Title: ONDUX Aprendizado N o-Supervisionado e sob Demanda para Extra o de Informa o Author: Eli Last modified by: Eli Cortez Created Date – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 32
Provided by: Eli1194
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Strategies for Information Extraction by Text Segmentation


1
Unsupervised Strategies for Information
Extraction by Text Segmentation
  • Eli Cortez, Altigran da Silva
  • Federal University of Amazonas - BRAZIL

2
Outline
  • Information Extraction by Text Segmentation
    (IETS)
  • Scenario and Problem
  • Challenges and Motivation
  • Related Work
  • ONDUX
  • Preliminary Experiments
  • Next Steps

3
Information Extraction by Text Segmentation
  • Text documents containing implicit
    semi-structured data records
  • Addresses
  • Bibliographic References
  • Classified Ads
  • Product Descriptions

4
Information Extraction by Text Segmentation
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
Neighborhood, Price, Number, Street,..., Phone
Address
Dr. Robert A. Jacobson, 8109 Harford Road,
Baltimore, MD 21214
Bibliographic Reference
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
5
Information Extraction by Text Segmentation
  • Why extracting information?
  • Database Storage, Query
  • Data Mining
  • Record Linkage.

ltNeighboorhoodgt Regent Square ltPricegt
228,900 ltNo.gt 1028 ltStreetgt Mifflin
Ave, ltBed.gt 6 Bedrooms ltBath..gt 2
Bathrooms ltPhonegt 412-638-7273
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
6
Information Extraction by Text Segmentation
  • Given an input string I representing an implicit
    textual record (e.g. classified ad), the IETS
    task consists in
  • Segmenting
  • Assigning to each segment a label corresponding
    to an attribute

7
IETS Challenges(I)
  • Information Extraction by Text Segmentation
    (IETS)
  • Borkar_at_SIGMOD'01, McCallum_at_ICML'01,
    Agichtein_at_SIGKDD'04, Mansuri_at_ICDE'06,
    Zhao_at_SICDM'08, Cortez_at_JASIST'09
  • Diversity of templates and styles
  • Attribute Ordering
  • Capitalization
  • Abbreviations.
  • Different applications share similar domains
  • Ex. Address and Ads
  • Records from both domains contain address
    information

8
IETS Challenges(II)
  • Diversity of templates and styles
  • Attribute Ordering Capitalization
    Abbreviations.

Link-based similarity measures for the
classication of Web documents. Pável Calado.
Journal of the American Society for the
Information Science and Technology 57(2) 2006
HomePage
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno Silva de Moura, Berthier A.
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST 57 (2) 208-221(2006)
DBLP
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
ACM
9
IETS Challenges(III)
  • Existing approaches deal with this problem use
    Machine Learning techniques
  • Hidden Markov Models (HMM)
  • Conditional Random Fields (CRF)
  • Support Vector Machines (SVM) (SSVM)
  • Supervised approaches require a hand-labeled
    training set created by an expert.
  • Each generated model is particular to a given
    application
  • High computational cost

10
Related Work(Semi) Supervised Approaches
  • Borkar et. al _at_ SIGMOD 2001
  • Supervised extraction method based on Hidden
    Markov Models (HMM)
  • McCallum et. al _at_ ICML 2001
  • Proposed the usage of Conditional Random Fields
    (CRF), an supervised model (S-CRF)
  • Mansuri et. al _at_ ICDE 2006
  • Semi-supervised approach based on CRF models

All of these approaches require an expert to
create a hand-labeled training set for each
application.
11
Related Work(Semi) Supervised Approaches
  • Hand-labeled examples

Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
ltNeighboorhoodgt Regent Square lt/Neighboorhoodgt
ltPricegt 228,900 lt/Pricegt ltNogt 1028 lt/Nogt
ltStreetgt Mifflin Ave, lt/Streetgt ltBedgt 6
Bedrooms lt/Bedgt ltBathgt 2 Bathrooms lt/Bathgt
ltPhonegt412-638-7273 lt/Phonegt
CRF and HMM learn from the given examples,
lexical, style, positioning and sequecing
features
Examples are source-dependent Scalability
problem, Reusing pre-existing models?
12
Related WorkUNSupervised Approaches
Knowledge Bases
Wikipedia Infobox
Semi-structured Records
DBpedia
FreeBase
Structured Records
13
Related WorkUNSupervised Approaches
Supervised X UNsupervised
Hand-labeled examples Source Dependent Scalabili
ty Problem Reusability
Pre-existing information Domain
Representation Easily adaptable
14
Related WorkUNSupervised Approaches
  • Agichtein et. al _at_ SIGKDD 2004
  • Usage of Reference Tables to create an
    unsupervised model using Hidden Markov Models
    (HMM)
  • Zhao et. al _at_ SIAM ICDM 2008
  • Usage of reference tables to create unsupervised
    CRF models - (U-CRF)
  • Cortez et. al _at_ JASIST 2009
  • Unsupervised method to extract bibliographic
    information

Both models assume single positioning and
ordering of attributes in all test instances.
(Distinct Orderings ?)
Domain-specific heuristics, not general
application.
15
Basic Concepts(I1)
  • Knowledge Base
  • Set of pairs KB
  • Building process trivial
  • Web Databases (Freebase, Googlebase)

KB (Neighboorhhod, O ), (Street, O
), (Phone, O ) O Regent
Square, Milenight Park O Regent
St., Morewood Ave., Square Ave. Park O
323 462-6252, (171) 289-7527
Neigh.
Street
Phone
Neigh.
Street
Phone
KB Domain Representation
Hand-labeled examples Source representation
16
Proposed Method
  • ONDUX Cortez et. al. _at_ SIGMOD 2010
  • Blocking
  • Matching
  • Reinforcement

17
ONDUX (II)
  • Overview

3
1
2
18
ONDUX (III)
  • Blocking
  • Split the input text in substrings called blocks
  • Consider the co-occurrence of consecutive terms
    based in the KB

Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
19
ONDUX (IV)
  • Matching
  • Associate each block generated in the previous
    phase with an attribute according to the
    Knowledge Base
  • We use distinct matching functions
  • Textual Values FF Function (Field Frequency)
  • Numeric Values NM Function (Numeric Matching)

20
ONDUX (V)
  • Matching

Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
21
ONDUX (VI)
  • How can we deal with blocks that were incorrectly
    labeled or were not associated to any attribute?

Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
22
ONDUX (VII)
  • Reinforcement
  • Review the labeling task performed in the
    Matching step
  • Unmatched blocks must receive a label of a given
    attribute
  • Mismatching blocks must be correctly labeled
  • How to handle this cases?
  • Using positioning and sequencing information that
    are obtained On-Demand.

23
ONDUX (VIII)
  • Reinforcement
  • Given the extraction output of the matching step
  • ONDUX automatically build a graphical structure,
    the PSM.
  • PSM Positioning and Sequencing Model.

24
ONDUX (IX)
  • Reinforcement
  • Extraction Result

Price No.
Street
???
Neighborhood
Street
Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
25
Experiments (1)
  • Setup
  • We tested our proposed approach in
  • Bibilographic Data (CORA, PersonalBib)
  • Collections are available in the Web

Test Set
KB, Reference Table,
Dataset Attributes records Source Attributes records
CORA 1..13 150 Cora 1..13 350
CORA 1..13 150 PersonalBib 7 395
26
Experiments (II)
  • Evaluation
  • Metrics
  • Precision, Recall and F-Measure
  • T-Test for the statistical validation of the
    results
  • Baseline
  • Conditional Random Fields (CRF)
  • U-CRF (Unsupervised method)
  • S-CRF (Classical supervised method)

27
Experiments (III)
  • Extraction Quality

CORA includes a variety of styles and information
(jconference, books)
S-CRF achieves higher results than U-CRF due to
the hand-labeled training
In general, Matching and Reinforcement Step of
ONDUX outperforms CRF models
28
Experiments (IV)
  • Extraction Quality

As discussed earlier, U-CRF is able to deal with
different attribute orderings
Due to the Matching and Reinforcement Strategies,
ONDUX outperforms CRF models
29
Conclusions andFuture Work (I)
  • Partial results of our research on unsupervised
    strategies for information extraction
  • ONDUX
  • Flexible Do not consider any particular style
  • Unsupervised Do not require any human effort to
    create a training set
  • On-Demand Ordering and Positioning Information
    are learned trough the Matching Phase

30
Conclusions and Future Work (II)
  • Proposed strategy achieve good results of
    precision and recall
  • Comparison with the state-of-art
  • As a Future Work
  • Investigate different matching functions
  • Multi-Record Extraction
  • Active Learning and Feedback
  • Error Detection
  • Nested structures?

31
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com