Unsupervised Strategies for Information Extraction by Text Segmentation - PowerPoint PPT Presentation

About This Presentation

Title:

Unsupervised Strategies for Information Extraction by Text Segmentation

Description:

Title: ONDUX Aprendizado N o-Supervisionado e sob Demanda para Extra o de Informa o Author: Eli Last modified by: Eli Cortez Created Date – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 32

Provided by: Eli1194

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised Strategies for Information Extraction by Text Segmentation

1
Unsupervised Strategies for Information
Extraction by Text Segmentation

Eli Cortez, Altigran da Silva
Federal University of Amazonas - BRAZIL

2
Outline

Information Extraction by Text Segmentation
(IETS)
Scenario and Problem
Challenges and Motivation
Related Work
ONDUX
Preliminary Experiments
Next Steps

3
Information Extraction by Text Segmentation

Text documents containing implicit
semi-structured data records
Addresses
Bibliographic References
Classified Ads
Product Descriptions

4
Information Extraction by Text Segmentation
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
Neighborhood, Price, Number, Street,..., Phone
Address
Dr. Robert A. Jacobson, 8109 Harford Road,
Baltimore, MD 21214
Bibliographic Reference
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
5
Information Extraction by Text Segmentation

Why extracting information?
Database Storage, Query
Data Mining
Record Linkage.

ltNeighboorhoodgt Regent Square ltPricegt
228,900 ltNo.gt 1028 ltStreetgt Mifflin
Ave, ltBed.gt 6 Bedrooms ltBath..gt 2
Bathrooms ltPhonegt 412-638-7273
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
6
Information Extraction by Text Segmentation

Given an input string I representing an implicit
textual record (e.g. classified ad), the IETS
task consists in
Segmenting
Assigning to each segment a label corresponding
to an attribute

7
IETS Challenges(I)

Information Extraction by Text Segmentation
(IETS)
Borkar_at_SIGMOD'01, McCallum_at_ICML'01,
Agichtein_at_SIGKDD'04, Mansuri_at_ICDE'06,
Zhao_at_SICDM'08, Cortez_at_JASIST'09
Diversity of templates and styles
Attribute Ordering
Capitalization
Abbreviations.
Different applications share similar domains
Ex. Address and Ads
Records from both domains contain address
information

8
IETS Challenges(II)

Diversity of templates and styles
Attribute Ordering Capitalization
Abbreviations.

Link-based similarity measures for the
classication of Web documents. Pável Calado.
Journal of the American Society for the
Information Science and Technology 57(2) 2006
HomePage
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno Silva de Moura, Berthier A.
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST 57 (2) 208-221(2006)
DBLP
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
ACM
9
IETS Challenges(III)

Existing approaches deal with this problem use
Machine Learning techniques
Hidden Markov Models (HMM)
Conditional Random Fields (CRF)
Support Vector Machines (SVM) (SSVM)

Supervised approaches require a hand-labeled
training set created by an expert.
Each generated model is particular to a given
application
High computational cost

10
Related Work(Semi) Supervised Approaches

Borkar et. al _at_ SIGMOD 2001
Supervised extraction method based on Hidden
Markov Models (HMM)
McCallum et. al _at_ ICML 2001
Proposed the usage of Conditional Random Fields
(CRF), an supervised model (S-CRF)
Mansuri et. al _at_ ICDE 2006
Semi-supervised approach based on CRF models

All of these approaches require an expert to
create a hand-labeled training set for each
application.
11
Related Work(Semi) Supervised Approaches

Hand-labeled examples

Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
ltNeighboorhoodgt Regent Square lt/Neighboorhoodgt
ltPricegt 228,900 lt/Pricegt ltNogt 1028 lt/Nogt
ltStreetgt Mifflin Ave, lt/Streetgt ltBedgt 6
Bedrooms lt/Bedgt ltBathgt 2 Bathrooms lt/Bathgt
ltPhonegt412-638-7273 lt/Phonegt
CRF and HMM learn from the given examples,
lexical, style, positioning and sequecing
features
Examples are source-dependent Scalability
problem, Reusing pre-existing models?
12
Related WorkUNSupervised Approaches
Knowledge Bases
Wikipedia Infobox
Semi-structured Records
DBpedia
FreeBase
Structured Records
13
Related WorkUNSupervised Approaches
Supervised X UNsupervised
Hand-labeled examples Source Dependent Scalabili
ty Problem Reusability
Pre-existing information Domain
Representation Easily adaptable
14
Related WorkUNSupervised Approaches

Agichtein et. al _at_ SIGKDD 2004
Usage of Reference Tables to create an
unsupervised model using Hidden Markov Models
(HMM)
Zhao et. al _at_ SIAM ICDM 2008
Usage of reference tables to create unsupervised
CRF models - (U-CRF)
Cortez et. al _at_ JASIST 2009
Unsupervised method to extract bibliographic
information

Both models assume single positioning and
ordering of attributes in all test instances.
(Distinct Orderings ?)
Domain-specific heuristics, not general
application.
15
Basic Concepts(I1)

Knowledge Base
Set of pairs KB
Building process trivial
Web Databases (Freebase, Googlebase)

KB (Neighboorhhod, O ), (Street, O
), (Phone, O ) O Regent
Square, Milenight Park O Regent
St., Morewood Ave., Square Ave. Park O
323 462-6252, (171) 289-7527
Neigh.
Street
Phone
Neigh.
Street
Phone
KB Domain Representation
Hand-labeled examples Source representation
16
Proposed Method

ONDUX Cortez et. al. _at_ SIGMOD 2010
Blocking
Matching
Reinforcement

17
ONDUX (II)

Overview

3
1
2
18
ONDUX (III)

Blocking
Split the input text in substrings called blocks
Consider the co-occurrence of consecutive terms
based in the KB

Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
19
ONDUX (IV)

Matching
Associate each block generated in the previous
phase with an attribute according to the
Knowledge Base
We use distinct matching functions
Textual Values FF Function (Field Frequency)
Numeric Values NM Function (Numeric Matching)

20
ONDUX (V)

Matching

Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
21
ONDUX (VI)

How can we deal with blocks that were incorrectly
labeled or were not associated to any attribute?

Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
22
ONDUX (VII)

Reinforcement
Review the labeling task performed in the
Matching step
Unmatched blocks must receive a label of a given
attribute
Mismatching blocks must be correctly labeled
How to handle this cases?
Using positioning and sequencing information that
are obtained On-Demand.

23
ONDUX (VIII)

Reinforcement
Given the extraction output of the matching step
ONDUX automatically build a graphical structure,
the PSM.
PSM Positioning and Sequencing Model.

24
ONDUX (IX)

Reinforcement
Extraction Result

Price No.
Street
???
Neighborhood
Street
Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
25
Experiments (1)

Setup
We tested our proposed approach in
Bibilographic Data (CORA, PersonalBib)
Collections are available in the Web

Test Set
KB, Reference Table,
Dataset Attributes records Source Attributes records
CORA 1..13 150 Cora 1..13 350
CORA 1..13 150 PersonalBib 7 395
26
Experiments (II)

Evaluation
Metrics
Precision, Recall and F-Measure
T-Test for the statistical validation of the
results
Baseline
Conditional Random Fields (CRF)
U-CRF (Unsupervised method)
S-CRF (Classical supervised method)

27
Experiments (III)

Extraction Quality

CORA includes a variety of styles and information
(jconference, books)
S-CRF achieves higher results than U-CRF due to
the hand-labeled training
In general, Matching and Reinforcement Step of
ONDUX outperforms CRF models
28
Experiments (IV)

Extraction Quality

As discussed earlier, U-CRF is able to deal with
different attribute orderings
Due to the Matching and Reinforcement Strategies,
ONDUX outperforms CRF models
29
Conclusions andFuture Work (I)

Partial results of our research on unsupervised
strategies for information extraction
ONDUX
Flexible Do not consider any particular style
Unsupervised Do not require any human effort to
create a training set
On-Demand Ordering and Positioning Information
are learned trough the Matching Phase

30
Conclusions and Future Work (II)