Title: Unsupervised Strategies for Information Extraction by Text Segmentation
1Unsupervised Strategies for Information
Extraction by Text Segmentation
- Eli Cortez, Altigran da Silva
- Federal University of Amazonas - BRAZIL
2Outline
- Information Extraction by Text Segmentation
(IETS) - Scenario and Problem
- Challenges and Motivation
- Related Work
- ONDUX
- Preliminary Experiments
- Next Steps
3Information Extraction by Text Segmentation
- Text documents containing implicit
semi-structured data records - Addresses
- Bibliographic References
- Classified Ads
- Product Descriptions
4Information Extraction by Text Segmentation
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
Neighborhood, Price, Number, Street,..., Phone
Address
Dr. Robert A. Jacobson, 8109 Harford Road,
Baltimore, MD 21214
Bibliographic Reference
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
5Information Extraction by Text Segmentation
- Why extracting information?
- Database Storage, Query
- Data Mining
- Record Linkage.
ltNeighboorhoodgt Regent Square ltPricegt
228,900 ltNo.gt 1028 ltStreetgt Mifflin
Ave, ltBed.gt 6 Bedrooms ltBath..gt 2
Bathrooms ltPhonegt 412-638-7273
Classified Ad
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
6Information Extraction by Text Segmentation
- Given an input string I representing an implicit
textual record (e.g. classified ad), the IETS
task consists in - Segmenting
- Assigning to each segment a label corresponding
to an attribute
7IETS Challenges(I)
- Information Extraction by Text Segmentation
(IETS) - Borkar_at_SIGMOD'01, McCallum_at_ICML'01,
Agichtein_at_SIGKDD'04, Mansuri_at_ICDE'06,
Zhao_at_SICDM'08, Cortez_at_JASIST'09 - Diversity of templates and styles
- Attribute Ordering
- Capitalization
- Abbreviations.
- Different applications share similar domains
- Ex. Address and Ads
- Records from both domains contain address
information
8IETS Challenges(II)
- Diversity of templates and styles
- Attribute Ordering Capitalization
Abbreviations.
Link-based similarity measures for the
classication of Web documents. Pável Calado.
Journal of the American Society for the
Information Science and Technology 57(2) 2006
HomePage
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno Silva de Moura, Berthier A.
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST 57 (2) 208-221(2006)
DBLP
Pável Calado, Marco Cristo, Marcos André
Gonçalves, Edleno S. de Moura, Berthier
Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web
documents. JASIST, v. 57 n.2, p. 208-221, January
2006
ACM
9IETS Challenges(III)
- Existing approaches deal with this problem use
Machine Learning techniques - Hidden Markov Models (HMM)
- Conditional Random Fields (CRF)
- Support Vector Machines (SVM) (SSVM)
- Supervised approaches require a hand-labeled
training set created by an expert. - Each generated model is particular to a given
application - High computational cost
10Related Work(Semi) Supervised Approaches
- Borkar et. al _at_ SIGMOD 2001
- Supervised extraction method based on Hidden
Markov Models (HMM) - McCallum et. al _at_ ICML 2001
- Proposed the usage of Conditional Random Fields
(CRF), an supervised model (S-CRF) - Mansuri et. al _at_ ICDE 2006
- Semi-supervised approach based on CRF models
All of these approaches require an expert to
create a hand-labeled training set for each
application.
11Related Work(Semi) Supervised Approaches
Regent Square 228,900 1028 Mifflin Ave. 6
Bedrooms 2 Bathrooms. 412-638-7273
ltNeighboorhoodgt Regent Square lt/Neighboorhoodgt
ltPricegt 228,900 lt/Pricegt ltNogt 1028 lt/Nogt
ltStreetgt Mifflin Ave, lt/Streetgt ltBedgt 6
Bedrooms lt/Bedgt ltBathgt 2 Bathrooms lt/Bathgt
ltPhonegt412-638-7273 lt/Phonegt
CRF and HMM learn from the given examples,
lexical, style, positioning and sequecing
features
Examples are source-dependent Scalability
problem, Reusing pre-existing models?
12Related WorkUNSupervised Approaches
Knowledge Bases
Wikipedia Infobox
Semi-structured Records
DBpedia
FreeBase
Structured Records
13Related WorkUNSupervised Approaches
Supervised X UNsupervised
Hand-labeled examples Source Dependent Scalabili
ty Problem Reusability
Pre-existing information Domain
Representation Easily adaptable
14Related WorkUNSupervised Approaches
- Agichtein et. al _at_ SIGKDD 2004
- Usage of Reference Tables to create an
unsupervised model using Hidden Markov Models
(HMM) - Zhao et. al _at_ SIAM ICDM 2008
- Usage of reference tables to create unsupervised
CRF models - (U-CRF) - Cortez et. al _at_ JASIST 2009
- Unsupervised method to extract bibliographic
information
Both models assume single positioning and
ordering of attributes in all test instances.
(Distinct Orderings ?)
Domain-specific heuristics, not general
application.
15Basic Concepts(I1)
- Knowledge Base
- Set of pairs KB
- Building process trivial
- Web Databases (Freebase, Googlebase)
KB (Neighboorhhod, O ), (Street, O
), (Phone, O ) O Regent
Square, Milenight Park O Regent
St., Morewood Ave., Square Ave. Park O
323 462-6252, (171) 289-7527
Neigh.
Street
Phone
Neigh.
Street
Phone
KB Domain Representation
Hand-labeled examples Source representation
16Proposed Method
- ONDUX Cortez et. al. _at_ SIGMOD 2010
- Blocking
- Matching
- Reinforcement
17ONDUX (II)
3
1
2
18ONDUX (III)
- Blocking
- Split the input text in substrings called blocks
- Consider the co-occurrence of consecutive terms
based in the KB
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
19ONDUX (IV)
- Matching
- Associate each block generated in the previous
phase with an attribute according to the
Knowledge Base - We use distinct matching functions
- Textual Values FF Function (Field Frequency)
- Numeric Values NM Function (Numeric Matching)
20ONDUX (V)
Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
21ONDUX (VI)
- How can we deal with blocks that were incorrectly
labeled or were not associated to any attribute?
Street Price No.
??? Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
22ONDUX (VII)
- Reinforcement
- Review the labeling task performed in the
Matching step - Unmatched blocks must receive a label of a given
attribute - Mismatching blocks must be correctly labeled
- How to handle this cases?
- Using positioning and sequencing information that
are obtained On-Demand.
23ONDUX (VIII)
- Reinforcement
- Given the extraction output of the matching step
- ONDUX automatically build a graphical structure,
the PSM. - PSM Positioning and Sequencing Model.
24ONDUX (IX)
- Reinforcement
- Extraction Result
Price No.
Street
???
Neighborhood
Street
Street
Regent Square 228,900 1028 Mifflin
Ave. 6 Bedrooms 2 Bathrooms.
412-638-7273
Bed. Bath.
Phone
25Experiments (1)
- Setup
- We tested our proposed approach in
- Bibilographic Data (CORA, PersonalBib)
- Collections are available in the Web
Test Set
KB, Reference Table,
Dataset Attributes records Source Attributes records
CORA 1..13 150 Cora 1..13 350
CORA 1..13 150 PersonalBib 7 395
26Experiments (II)
- Evaluation
- Metrics
- Precision, Recall and F-Measure
- T-Test for the statistical validation of the
results - Baseline
- Conditional Random Fields (CRF)
- U-CRF (Unsupervised method)
- S-CRF (Classical supervised method)
27Experiments (III)
CORA includes a variety of styles and information
(jconference, books)
S-CRF achieves higher results than U-CRF due to
the hand-labeled training
In general, Matching and Reinforcement Step of
ONDUX outperforms CRF models
28Experiments (IV)
As discussed earlier, U-CRF is able to deal with
different attribute orderings
Due to the Matching and Reinforcement Strategies,
ONDUX outperforms CRF models
29Conclusions andFuture Work (I)
- Partial results of our research on unsupervised
strategies for information extraction - ONDUX
- Flexible Do not consider any particular style
- Unsupervised Do not require any human effort to
create a training set - On-Demand Ordering and Positioning Information
are learned trough the Matching Phase
30Conclusions and Future Work (II)
- Proposed strategy achieve good results of
precision and recall - Comparison with the state-of-art
- As a Future Work
- Investigate different matching functions
- Multi-Record Extraction
- Active Learning and Feedback
- Error Detection
- Nested structures?
31