Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

Description:

University of Economics Prague. Information Extraction Based on ... writing patterns, axioms, iterating. Training inductive model in addition to ex. ontology ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 37
Provided by: martin582
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation


1
Information Extraction Based on Extraction
Ontologies Design, Deployment and Evaluation
  • Martin Labský, Vojtech Svátek
  • Dept. of Knowledge Engineering, UEP
  • labsky,svatek_at_vse.cz
  • AI Seminar, November 13th 2008

2
Agenda
  1. Example applications of Web IE
  2. Difficulties in practical applications
  3. Extraction Ontologies
  4. Extraction process
  5. Experimental results
  6. Future work and Conclusion

3
Example apps of Web IE (1/5) online products
4
Example apps of Web IE (2/5) contact information
5
Example apps of Web IE (3/5) seminars, events
6
Example apps of Web IE (4/5) bike products
7
Example apps of Web IE (4/5)
  • Store the extracted results in a DB to enable
    structured search over documents
  • information retrieval
  • database-like querying
  • e.g. online product search engine,
  • e.g. building a contact DB
  • Support for web page quality assessment
  • involved in an EU project MedIEQ to support
    medical website accreditation agencies
  • Source documents
  • internet, intranet, emails
  • can be very diverse

8
Agenda
  1. Example applications of Web IE
  2. Difficulties in practical IE applications
  3. Extraction Ontologies
  4. Extraction process
  5. Experimental results
  6. Future work and Conclusion

9
Difficulties in practical applications (1/3)
  • Requirements
  • quickly prototype IE applications
  • not necessarily with the best accuracy initially
  • often needed for a proof-of-concept application
  • then more work can be done to boost accuracy
  • the extraction model changes
  • meaning of to-be-extracted items may shift,
  • new items are often added or removed

10
Difficulties in practical applications (2/3)
  • Purely manual rules
  • writing extraction rules manually does not scale
    when more complex extraction rules need to be
    encoded
  • not easy to combine with trained models when
    training data become available in later phases
  • Training data
  • trainable IE systems often require large amounts
    of training data these are typically not
    available for the desired task
  • when training data is collected, it is not easy
    to adapt it to modified or additional criteria
  • Wrappers
  • cannot rely on wrapper-only systems when
    extracting from multiple websites
  • non-wrapper systems often do not utilize regular
    formatting cues

11
Difficulties in practical applications (3/3)
  • Seems interesting to exploit at the same time
  • extraction knowledge from domain experts
  • training data
  • formatting regularities

12
Agenda
  1. Example applications of Web IE
  2. Difficulties in practical applications
  3. Extraction Ontologies
  4. Extraction process
  5. Experimental results
  6. Future work and Conclusion

13
Extraction ontologies
  • An extraction ontology is a part of a domain
    ontology transformed to suit extraction needs
  • Contains classes composed of attributes
  • more like UML class diagrams, less like
    ontologies where e.g. relations are standalone
  • also contains axioms related to classes or
    attributes
  • Classes and attributes are augmented with
    extraction evidence
  • manually provided patterns for content and
    context
  • axioms
  • value or length ranges
  • links to trained models

Person name 1 degree 0-5 email 0-2 phone
0-3
Responsible
14
Extraction evidence provided by domain expert (1)
  • Patterns
  • for attributes and classes
  • for their content and context
  • patterns may be defined at the following levels
  • word and character-level,
  • formatting tag level
  • level of labels (e.g. sentence breaks, POS tags)
  • Attribute value constraints
  • word length constraints, numeric value ranges
  • possible to attach units to numeric attributes
  • Axioms
  • may enforce relations among attributes
  • interpreted using JavaScript scripting language
  • Simple co-reference resolution rules

15
Extraction evidence provided by domain expert (2)
  • Axioms
  • class level
  • attribute level
  • Patterns
  • class content
  • attribute value
  • attribute context
  • class context
  • Value constraints
  • word length
  • numeric value

16
Extraction evidence based on trained models (1)
  • Links to trainable classifiers
  • may classify attributes only
  • binary or multi-class
  • Trained models may use as features
  • simple word level features (word itself, word
    type, possibly POS tags)
  • re-use all evidence provided by expert (patterns,
    axioms, constraints)
  • induced binary features based on word n-grams

classifier usage
classifier definition
17
Extraction evidence based on trained models (2)
  • Data representation for classifiers
  • word sequence (1 word 1 sample)
  • phrase set (sliding window method)
  • Tested trainable classifiers
  • CRF (Conditional Random Fields)
    http//crfpp.sourceforge.net
  • algorithms from the Weka machine learning toolkit
  • SVM (Support Vector Machine)
  • JRip (rule induction)
  • http//www.cs.waikato.ac.nz/ml/weka
  • Hidden Markov Model extractor

18
Extraction evidence based on trained models (3)
  • Feature induction
  • candidate features are all word n-grams of given
    lengths occurring inside or near training
    attribute values
  • pruning parameters
  • point-wise mutual information thresholds
  • minimal absolute occurrence count
  • maximum number of features

19
Probabilistic model to combine evidence
  • Each piece of evidence E is equipped with 2
    probability estimates with respect to predicted
    attribute A
  • evidence precision P(AE) ... prediction
    confidence
  • evidence coverage P(EA) ... necessity of
    evidence (support)
  • Each attribute is assigned some low prior
    probability P(A)
  • Let be the set of evidence applicable to A
  • Assume conditional independence among
  • Using Bayes formula we compute P(A its evidence
    values) as
  • where

20
Extraction vs. domain ontologies
  • When existing domain ontologies are available
  • identify relevant parts
  • reuse classes, attributes, cardinalities, some
    axioms
  • Transformation rules
  • reused parts of domain ontology may require
    transformation to fit into extraction ontology
  • due to extraction ontologies focusing on the way
    of presentation rather than semantics
  • identified typical transformation rules that
    could be used to transform parts of OWL-encoded
    ontologies

21
Agenda
  1. Example applications of Web IE
  2. Difficulties in practical applications
  3. Extraction Ontologies
  4. Extraction process
  5. Experimental results
  6. Future work and Conclusion

22
The extraction process (1/5)
  • Tokenize, build HTML formatting tree, apply
    sentence splitter, POS tagger
  • Match patterns
  • Apply trained models
  • Create Attribute Candidates (ACs)
  • For each created AC, let PAC
  • prune ACs below threshold
  • build document AC lattice, score ACs by log(PAC)

Washington , DC
...
...
23
The extraction process (2/5)
  • Evaluate coreference resolution rules for each
    pair of ACs
  • e.g. Dr. Burns ?? John Burns
  • possible coreferring groups are remembered
  • in attributes value section
  • Compute the best scoring path BP through AC
    lattice
  • using dynamic programming
  • Run wrapper induction algorithm using all AC ? BP
  • wrapper induction algorithm described in next
    slides
  • if new local patterns are induced, apply them to
  • rescore existing ACs
  • create new ACs
  • update AC lattice, recompute BP
  • Terminate here if no instances are to be
    generated
  • output all AC ? BP (n-best paths supported)

24
The extraction process (3/5)
  • Generate Instance Candidates (ICs) bottom-up
  • triangular trellis used to store partial ICs
  • when scoring new ICs, only consider axioms and
    patterns that already can be applied to the IC.
    Validity is not required.
  • pruning parameters abs and relative beam size at
    trellis node, maximum number of ACs that can be
    skipped, min IC probability

25
The extraction process (4/5)
  • IC generation continued
  • When new IC is created, its P(IC) is computed
    from 2 components
  • where IC is member attribute count,
  • ACskip is an non-member AC that is fully or
    partially inside the IC,
  • PAC skip is the probability of AC being a false
    positive.
  • where ?C is the set of evidence known for the
    class C, computed using the same probabilistic
    model as for ACs.
  • Scores are combined using the Prospector
    pseudo-bayesian method

26
The extraction process (5/5)
  • Insert valid ICs into AC lattice
  • Valid ICs were assembled during IC generation
    phase
  • Score of a valid IC reflects all extraction
    evidence of its class
  • All unpruned valid ICs are inserted into the AC
    lattice, scored by
  • The best path BP is calculated through the ICAC
    lattice (n-best supported)
  • the search algorithm allows constraints to be
    defined over the extracted path(s)
  • e.g. min/max count of extracted instances
  • output all ACs and ICs on BP

IC1
27
Extraction evidence based on formatting
  • A simple wrapper induction algorithm
  • identify formatting regularities
  • turn them into local context patterns to boost
    contained ACs
  • Assemble distinct formatting subtrees rooted at
    block elements containing ACs from the best path
    BP currently determined by the system
  • For each subtree S, calculate
  • If both C(S,Att) and prec(AttS) reach defined
    thresholds, a new local context pattern is
    created with its precision set to C(S,Att) and
    its recall close to 0 (in order not to harm
    potential singleton ACs.

a formatting tree learned using known names like
John Doe and applied to unknown names
TD
TD
A_href
B
A_href
B
John Doe
jdoe_at_web.ca
Argentina Agosto
aa_at_web.br
28
Agenda
  1. Example applications of Web IE
  2. Difficulties in practical applications
  3. Extraction Ontologies
  4. Extraction process
  5. Experimental results
  6. Future work and Conclusion

29
Experimental results Seminar announcements
  • 485 English seminar announcement text documents
  • Manual extraction ontology created based on
    seeing 40 randomly chosen documents, evaluated
    using remaining 445
  • ManualCRF same extraction ontology equipped
    with a CRF classifier used as further extraction
    evidence. 10-fold cross-validation using test set
    above

30
Cost of the IE system Seminar announcements
  • Creation of extraction ontology 1-2 person weeks
  • annotate 40 training documents (expect 1-2 days)
  • inspecting examples in 40 documents
  • writing patterns, axioms, iterating
  • Training inductive model in addition to ex.
    ontology
  • 2-3 person weeks to annotate training data (445
    docs)
  • F-measure improvement from 2 to 6
  • ex. ontologies allow for fast flexible
    prototyping (annotation design changes quickly
    reflected)
  • then, for parts of the ex. ontology that need
    accuracy improvement, obtain more training data
    reuse as features all manual extraction evidence
    already provided

31
Experimental results Contact information
  • 109 English contact pages, 200 Spanish, 108 Czech
  • Named entity counts 7000, 5000, 11000,
    respectively, instances not labeled
  • Only domain experts evidence and formatting
    pattern induction were used
  • Domain expert saw 30 randomly chosen documents,
    the rest was test data
  • Instance extraction done but not evaluated
  • Instance grouping
  • Villain score F 60-70
  • Villain recall of correct links recovered
  • Villain precision of recovered links that are
    correct

32
Experimental results Bicycle descriptions
  • Hidden Markov Model
  • Trigram, naive topology
  • 103 labeled web pages, 12346 named entities,
  • Instances not labeled instance extraction done
    but not evaluated
  • Single HMM for all extracted types
  • 1 Background state
  • 1 Target, 1 Prefix and 1 Suffix state type for
    each extracted slot
  • 13N states

B
S
T
P
S
T
P
...
33
Bicycle structured search interface
34
Future work
  • Attempt to improve a seed extraction ontology by
    bootstrapping using relevant pages retrieved from
    the Internet
  • Adapt the structure of extraction ontology
    according to data
  • e.g. add new attributes to represent product
    features

35
Conclusions
  • Tooltutorial available
  • http//eso.vse.cz/labsky/ex/
  • Presented an extraction ontology approach to
  • allow for fast prototyping of IE applications
  • accommodate extraction schema changes easily
  • utilize all available forms of extraction
    knowledge
  • domain experts knowledge
  • training data
  • formatting regularities found in web pages
  • Results
  • indicate that extraction ontologies can serve as
    a quick prototyping tool
  • accuracy of the prototyped ontology can be
    improved when training data become available

36
Acknowledgements
  • The research was partially supported by the EC
    under contract FP6-027026, Knowledge Space of
    Semantic Inference for Automatic Annotation and
    Retrieval of Multimedia Content K-Space.
  • The medical website application is carried out in
    the context of the EC-funded (DG-SANCO) project
    MedIEQ.
Write a Comment
User Comments (0)
About PowerShow.com