Title: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation
1Information Extraction Based on Extraction
Ontologies Design, Deployment and Evaluation
- Martin Labský, Vojtech Svátek
- Dept. of Knowledge Engineering, UEP
- labsky,svatek_at_vse.cz
- AI Seminar, November 13th 2008
2Agenda
- Example applications of Web IE
- Difficulties in practical applications
- Extraction Ontologies
- Extraction process
- Experimental results
- Future work and Conclusion
3Example apps of Web IE (1/5) online products
4Example apps of Web IE (2/5) contact information
5Example apps of Web IE (3/5) seminars, events
6Example apps of Web IE (4/5) bike products
7Example apps of Web IE (4/5)
- Store the extracted results in a DB to enable
structured search over documents - information retrieval
- database-like querying
- e.g. online product search engine,
- e.g. building a contact DB
- Support for web page quality assessment
- involved in an EU project MedIEQ to support
medical website accreditation agencies - Source documents
- internet, intranet, emails
- can be very diverse
8Agenda
- Example applications of Web IE
- Difficulties in practical IE applications
- Extraction Ontologies
- Extraction process
- Experimental results
- Future work and Conclusion
9Difficulties in practical applications (1/3)
- Requirements
- quickly prototype IE applications
- not necessarily with the best accuracy initially
- often needed for a proof-of-concept application
- then more work can be done to boost accuracy
- the extraction model changes
- meaning of to-be-extracted items may shift,
- new items are often added or removed
10Difficulties in practical applications (2/3)
- Purely manual rules
- writing extraction rules manually does not scale
when more complex extraction rules need to be
encoded - not easy to combine with trained models when
training data become available in later phases - Training data
- trainable IE systems often require large amounts
of training data these are typically not
available for the desired task - when training data is collected, it is not easy
to adapt it to modified or additional criteria - Wrappers
- cannot rely on wrapper-only systems when
extracting from multiple websites - non-wrapper systems often do not utilize regular
formatting cues
11Difficulties in practical applications (3/3)
- Seems interesting to exploit at the same time
- extraction knowledge from domain experts
- training data
- formatting regularities
12Agenda
- Example applications of Web IE
- Difficulties in practical applications
- Extraction Ontologies
- Extraction process
- Experimental results
- Future work and Conclusion
13Extraction ontologies
- An extraction ontology is a part of a domain
ontology transformed to suit extraction needs - Contains classes composed of attributes
- more like UML class diagrams, less like
ontologies where e.g. relations are standalone - also contains axioms related to classes or
attributes - Classes and attributes are augmented with
extraction evidence - manually provided patterns for content and
context - axioms
- value or length ranges
- links to trained models
Person name 1 degree 0-5 email 0-2 phone
0-3
Responsible
14Extraction evidence provided by domain expert (1)
- Patterns
- for attributes and classes
- for their content and context
- patterns may be defined at the following levels
- word and character-level,
- formatting tag level
- level of labels (e.g. sentence breaks, POS tags)
- Attribute value constraints
- word length constraints, numeric value ranges
- possible to attach units to numeric attributes
- Axioms
- may enforce relations among attributes
- interpreted using JavaScript scripting language
- Simple co-reference resolution rules
15Extraction evidence provided by domain expert (2)
- Axioms
- class level
- attribute level
- Patterns
- class content
- attribute value
- attribute context
- class context
- Value constraints
- word length
- numeric value
16Extraction evidence based on trained models (1)
- Links to trainable classifiers
- may classify attributes only
- binary or multi-class
- Trained models may use as features
- simple word level features (word itself, word
type, possibly POS tags) - re-use all evidence provided by expert (patterns,
axioms, constraints) - induced binary features based on word n-grams
classifier usage
classifier definition
17Extraction evidence based on trained models (2)
- Data representation for classifiers
- word sequence (1 word 1 sample)
- phrase set (sliding window method)
- Tested trainable classifiers
- CRF (Conditional Random Fields)
http//crfpp.sourceforge.net - algorithms from the Weka machine learning toolkit
- SVM (Support Vector Machine)
- JRip (rule induction)
- http//www.cs.waikato.ac.nz/ml/weka
- Hidden Markov Model extractor
18Extraction evidence based on trained models (3)
- Feature induction
- candidate features are all word n-grams of given
lengths occurring inside or near training
attribute values - pruning parameters
- point-wise mutual information thresholds
- minimal absolute occurrence count
- maximum number of features
19Probabilistic model to combine evidence
- Each piece of evidence E is equipped with 2
probability estimates with respect to predicted
attribute A - evidence precision P(AE) ... prediction
confidence - evidence coverage P(EA) ... necessity of
evidence (support) - Each attribute is assigned some low prior
probability P(A) - Let be the set of evidence applicable to A
- Assume conditional independence among
- Using Bayes formula we compute P(A its evidence
values) as - where
20Extraction vs. domain ontologies
- When existing domain ontologies are available
- identify relevant parts
- reuse classes, attributes, cardinalities, some
axioms - Transformation rules
- reused parts of domain ontology may require
transformation to fit into extraction ontology - due to extraction ontologies focusing on the way
of presentation rather than semantics - identified typical transformation rules that
could be used to transform parts of OWL-encoded
ontologies
21Agenda
- Example applications of Web IE
- Difficulties in practical applications
- Extraction Ontologies
- Extraction process
- Experimental results
- Future work and Conclusion
22The extraction process (1/5)
- Tokenize, build HTML formatting tree, apply
sentence splitter, POS tagger - Match patterns
- Apply trained models
- Create Attribute Candidates (ACs)
- For each created AC, let PAC
- prune ACs below threshold
- build document AC lattice, score ACs by log(PAC)
Washington , DC
...
...
23The extraction process (2/5)
- Evaluate coreference resolution rules for each
pair of ACs - e.g. Dr. Burns ?? John Burns
- possible coreferring groups are remembered
- in attributes value section
- Compute the best scoring path BP through AC
lattice - using dynamic programming
- Run wrapper induction algorithm using all AC ? BP
- wrapper induction algorithm described in next
slides - if new local patterns are induced, apply them to
- rescore existing ACs
- create new ACs
- update AC lattice, recompute BP
- Terminate here if no instances are to be
generated - output all AC ? BP (n-best paths supported)
24The extraction process (3/5)
- Generate Instance Candidates (ICs) bottom-up
- triangular trellis used to store partial ICs
- when scoring new ICs, only consider axioms and
patterns that already can be applied to the IC.
Validity is not required. - pruning parameters abs and relative beam size at
trellis node, maximum number of ACs that can be
skipped, min IC probability
25The extraction process (4/5)
- IC generation continued
- When new IC is created, its P(IC) is computed
from 2 components - where IC is member attribute count,
- ACskip is an non-member AC that is fully or
partially inside the IC, - PAC skip is the probability of AC being a false
positive. - where ?C is the set of evidence known for the
class C, computed using the same probabilistic
model as for ACs. - Scores are combined using the Prospector
pseudo-bayesian method
26The extraction process (5/5)
- Insert valid ICs into AC lattice
- Valid ICs were assembled during IC generation
phase - Score of a valid IC reflects all extraction
evidence of its class - All unpruned valid ICs are inserted into the AC
lattice, scored by - The best path BP is calculated through the ICAC
lattice (n-best supported) - the search algorithm allows constraints to be
defined over the extracted path(s) - e.g. min/max count of extracted instances
- output all ACs and ICs on BP
IC1
27Extraction evidence based on formatting
- A simple wrapper induction algorithm
- identify formatting regularities
- turn them into local context patterns to boost
contained ACs - Assemble distinct formatting subtrees rooted at
block elements containing ACs from the best path
BP currently determined by the system - For each subtree S, calculate
- If both C(S,Att) and prec(AttS) reach defined
thresholds, a new local context pattern is
created with its precision set to C(S,Att) and
its recall close to 0 (in order not to harm
potential singleton ACs.
a formatting tree learned using known names like
John Doe and applied to unknown names
TD
TD
A_href
B
A_href
B
John Doe
jdoe_at_web.ca
Argentina Agosto
aa_at_web.br
28Agenda
- Example applications of Web IE
- Difficulties in practical applications
- Extraction Ontologies
- Extraction process
- Experimental results
- Future work and Conclusion
29Experimental results Seminar announcements
- 485 English seminar announcement text documents
- Manual extraction ontology created based on
seeing 40 randomly chosen documents, evaluated
using remaining 445 - ManualCRF same extraction ontology equipped
with a CRF classifier used as further extraction
evidence. 10-fold cross-validation using test set
above
30Cost of the IE system Seminar announcements
- Creation of extraction ontology 1-2 person weeks
- annotate 40 training documents (expect 1-2 days)
- inspecting examples in 40 documents
- writing patterns, axioms, iterating
- Training inductive model in addition to ex.
ontology - 2-3 person weeks to annotate training data (445
docs) - F-measure improvement from 2 to 6
- ex. ontologies allow for fast flexible
prototyping (annotation design changes quickly
reflected) - then, for parts of the ex. ontology that need
accuracy improvement, obtain more training data
reuse as features all manual extraction evidence
already provided
31Experimental results Contact information
- 109 English contact pages, 200 Spanish, 108 Czech
- Named entity counts 7000, 5000, 11000,
respectively, instances not labeled - Only domain experts evidence and formatting
pattern induction were used - Domain expert saw 30 randomly chosen documents,
the rest was test data - Instance extraction done but not evaluated
- Instance grouping
- Villain score F 60-70
- Villain recall of correct links recovered
- Villain precision of recovered links that are
correct
32Experimental results Bicycle descriptions
- Hidden Markov Model
- Trigram, naive topology
- 103 labeled web pages, 12346 named entities,
- Instances not labeled instance extraction done
but not evaluated - Single HMM for all extracted types
- 1 Background state
- 1 Target, 1 Prefix and 1 Suffix state type for
each extracted slot - 13N states
B
S
T
P
S
T
P
...
33Bicycle structured search interface
34Future work
- Attempt to improve a seed extraction ontology by
bootstrapping using relevant pages retrieved from
the Internet - Adapt the structure of extraction ontology
according to data - e.g. add new attributes to represent product
features
35Conclusions
- Tooltutorial available
- http//eso.vse.cz/labsky/ex/
- Presented an extraction ontology approach to
- allow for fast prototyping of IE applications
- accommodate extraction schema changes easily
- utilize all available forms of extraction
knowledge - domain experts knowledge
- training data
- formatting regularities found in web pages
- Results
- indicate that extraction ontologies can serve as
a quick prototyping tool - accuracy of the prototyped ontology can be
improved when training data become available
36Acknowledgements
- The research was partially supported by the EC
under contract FP6-027026, Knowledge Space of
Semantic Inference for Automatic Annotation and
Retrieval of Multimedia Content K-Space. - The medical website application is carried out in
the context of the EC-funded (DG-SANCO) project
MedIEQ.