On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning - PowerPoint PPT Presentation

Loading...

PPT – On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning PowerPoint presentation | free to download - id: 764fa0-YzZmM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning

Description:

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 43
Provided by: Aias4
Learn more at: http://users.iit.demokritos.gr
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning


1
On the Need to Bootstrap Ontology Learning with
Extraction Grammar Learning
Georgios PaliourasSoftware Knowledge
Engineering LabInst. of Informatics
TelecommunicationsNCSR Demokritoshttp//www.ii
t.demokritos.gr/paliourg
Kassel, 22 July 2005
2
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

3
Motivation
  • Practical information extraction requires a
    conceptual description of the domain, e.g. an
    ontology, and a grammar.
  • Manual creation and maintenance of these
    resources is expensive.
  • Machine learning has been used to
  • Learn ontologies based on extracted instances.
  • Learn extraction grammars, given the conceptual
    model.
  • Study how the two processes are interacting and
    the possibility of combining them.

4
Information extraction
  • Common approach shallow parsing with regular
    grammars.
  • Limited use of deep analysis to improve
    extraction accuracy (HPSGs, concept graphs).
  • Linking of extraction patterns to ontologies
    (e.g. information extraction ontologies).
  • Initial attempts to combine syntax and semantics
    (Systemic Functional Grammars).
  • Learning simple extraction patterns (regular
    expressions, HMMs, tree-grammars, etc.)

5
Ontology learning
  • Deductive approach to ontology modification
    driven by linguistic rules.
  • Inductive identification of new concepts/terms.
  • Clustering, based on lexico-syntactic analysis of
    the text (subcat frames).
  • Formal Concept Analysis for term clustering and
    concept identification.
  • Clustering and merging of conceptual graphs
    (conceptual graph theory).
  • Deductive learning of extraction grammars in
    parallel with the identification of concepts.

6
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

7
SKEL - vision
  • Research objectiveinnovative knowledge
    technologies for reducing the information
    overload on the Web
  • Areas of research activity
  • Information gathering (retrieval, crawling,
    spidering)
  • Information filtering (text and multimedia
    classification)
  • Information extraction (named entity recognition
    and classification, role identification,
    wrappers, grammar and lexicon learning)
  • Personalization (user stereotypes and
    communities)
  • Ontology learning and population

8
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

9
CROSSMARC Objectives
Develop technology for Information Integration
that can
  • crawl the Web for interesting Web pages,
  • extract information from pages of different sites
    without a standardized format (structured,
    semi-structured, free text),
  • process Web pages written in several languages,
  • be customized semi-automatically to new domains
    and languages,
  • deliver integrated information according to
    personalized profiles.

10
CROSSMARC Architecture
Ontology
11
CROSSMARC Ontology
ltnode idref"OV-d0e1041"gt   ltsynonymgtIntel
Pentium IIIlt/synonymgt   ltsynonymgtPentium
IIIlt/synonymgt   ltsynonymgtP3lt/synonymgt  
ltsynonymgtPIIIlt/synonymgt lt/nodegt
ltdescriptiongtLaptopslt/descriptiongt
ltfeaturesgt ltfeature id"OF-d0e5"gt
ltdescriptiongtProcessorlt/descriptiongt
ltattribute type"basic" id"OA-d0e7"gt
ltdescriptiongtProcessor Namelt/descriptiongt
ltdiscrete_set type"open"gt ltvalue
id"OV-d0e1041"gt ltdescriptiongtIntel
Pentium 3lt/descriptiongt lt/valuegt

Lexicon
ltnode idref"OA-d0e7"gt   ltsynonymgt???µa
?pe?e??ast?lt/synonymgt lt/nodegt
Ontology
Greek Lexicon
12
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

13
Meta-learning for Web IE
  • Motivation
  • There are many different learning methods,
    producing different types of extraction grammar.
  • In CROSSMARC we had four different approaches
    with significant difference in the extracted
    information.
  • Proposed approach
  • Use meta-learning to combine the strengths of
    individual learning methods.

14
Meta-learning for Web IE
Stacked generalization
New vector x
Base-level dataset D
D \ Dj
Dj
C1...CN
L1LN
Meta-level vector
C1(j)CN(j)
L1LN
CM
MDj
LM
Meta-level dataset MD
Class value y(x)
15
Meta-learning for Web IE
Information Extraction is not naturally a
classification task
In IE we deal with text documents, paired with
templates
Each template is filled with instances ltt(s,e),
fgt
TransPort ZX ltbrgt ltfont size"1"gt ltbgt 15" XGA TFT Display lt/bgt ltbrgt Intel ltbgt Pentium III 600 MHZ lt/bgt 256k Mobile processor ltbrgt ltbgt 256 MB SDRAM up to 1GB
Template T Template T Template T
t(s,e) s, e Field f
Transport ZX 47, 49 Model
15 56, 58 screenSize
TFT 59, 60 screenType
Intel ltbgt Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
16
Meta-learning for Web IE
Combining Information Extraction systems
TransPort ZX ltbrgt ltfont size"1"gt ltbgt 15" XGA TFT Display lt/bgt ltbrgt Intel ltbgt Pentium III 600 MHZ lt/bgt 256k Mobile processor ltbrgt ltbgt 256 MB SDRAM up to 1GB
T1 filled by the IE system E1 T1 filled by the IE system E1 T1 filled by the IE system E1
t(s, e) s, e f
Transport ZX 47, 49 model
15 56, 58 screenSize
TFT 59, 60 screenType
Intel ltbgt Pentium III 63, 67 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 ram
T2 filled by the IE system E2 T2 filled by the IE system E2 T2 filled by the IE system E2
t(s, e) s, e f
Transport ZX 47, 49 manuf
TFT 59, 60 screenType
Intel ltbgt Pentium 63, 66 procName
600 MHz 67, 69 procSpeed
256 MB 76, 78 ram
1 GB 81, 83 HDcapacity
17
Meta-learning for Web IE
Creating a stacked template
TransPort ZX ltbrgt ltfont size"1"gt ltbgt 15" XGA TFT Display lt/bgt ltbrgt Intel ltbgt Pentium III 600 MHZ lt/bgt 256k Mobile processor ltbrgt ltbgt 256 MB SDRAM up to 1GB
Stacked template (ST) Stacked template (ST) Stacked template (ST) Stacked template (ST) Stacked template (ST)
s, e t(s, e) Field by E1 Field by E2 Correct field
47, 49 Transport ZX model manuf model
56, 58 15 screenSize - screenSize
59, 60 TFT screenType screenType screenType
63, 66 IntelltbgtPentium - procName -
63, 67 IntelltbgtPentium III procName - procName
67, 69 600 MHz procSpeed procSpeed procSpeed
76, 78 256 MB ram ram ram
81, 83 1 GB ram HDcapacity -
18
Meta-learning for Web IE
Training in the new stacking framework
D set of documents, paired with hand-filled
templates
D \ Dj
Dj
E1EN
L1LN
E1(j)EN(j)
L1LN
ST1
ST2

CM
LM
MDj
MD set of meta-level feature vectors
19
Meta-learning for Web IE
Stacking at run-time
E1
T1
Stacked template
New document d
E2
T2
CM

ltt(s,e), fgt
EN
TN
Final template
T
20
Experimental results
F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.
Domain Best base Stacking
Courses 65.73 71.93
Projects 61.64 70.66
Laptops 63.81 71.55
Jobs 83.22 85.94
Seminars 86.23 90.03
21
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

22
Learning CFGs
  • Motivation
  • Wanting to provide more complex extraction
    patterns for less structured text.
  • Wanting to learn more compact and
    human-comprehensible grammars.
  • Wanting to be able to process large corpora
    containing only positive examples.
  • Proposed approach
  • Efficient learning of context free grammars from
    positive examples, guided by Minimum Description
    Length.

23
Learning CFGs
Introducing eg-GRIDS
  • Infers context-free grammars.
  • Learns from positive examples only.
  • Overgenarisation controlled through a heuristic,
    based on MDL.
  • Two basic/three auxiliary learning operators.
  • Two search strategies
  • Beam search.
  • Genetic search.

24
Learning CFGs
Model Length (ML) GDL DDL
25
Learning CFGs
26
Experimental results
  • The Dyck language with k1 S ? S S
    ( S ) ?
  • Errors of
  • Omission failures to parse sentences generated
    from the correct grammar (longer test sentences
    than in the training set).
  • Overly specific grammar.
  • Commission failures of the correct grammar to
    parse sentences generated by the inferred
    grammar.
  • Overly general grammar.

27
Experimental results
28
Experimental results
29
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

30
Ontology Enrichment
  • We concentrate on instances.
  • Highly evolving domain (e.g. laptop descriptions)
  • New Instances characterize new concepts.
  • e.g. Pentium 2 is an instance that denotes a
    new concept if it doesnt exist in the ontology.
  • New surface appearance of an instance.
  • e.g. PIII is a different surface appearance of
    Intel Pentium 3
  • The poor performance of many Information
    Integration systems is due to their incapability
    to handle the evolving nature of the domain.

31
Ontology Enrichment
Annotating Corpus Using Domain Ontology
machine learning
Corpus
Additional annotations
Multi-Lingual Domain Ontology
Information extraction
Ontology Enrichment / Population
Validation
Domain Expert
32
Finding synonyms
  • The number of instances for validation increases
    with the size of the corpus and the ontology.
  • There is a need for supporting the enrichment of
    the synonymy relationship.
  • Discover automatically different surface
    appearances of an instance (CROSSMARC synonymy
    relationship).
  • Issues to be handled

Synonym Intel pentium 3 -
Intel pIII
Orthographical Intel p3 - intell p3
Lexicographical Hewlett Packard - HP
Combination Intell Pentium 3 -
P III
33
COCLU
  • COCLU (COmpression-based CLUstering) a model
    based algorithm that discovers typographic
    similarities between strings (sequences of
    elements-letters) over an alphabet (ASCII
    characters) employing a new score function
    CCDiff.
  • CCDiff is defined as the difference in the code
    length of a cluster (i.e., of its instances),
    when adding a candidate string. Huffman trees are
    used as models of the clusters.
  • COCLU iteratively computes the CCDiff of each new
    string from each cluster implementing a
    hill-climbing search. The new string is added to
    the closest cluster, or a new cluster is created
    (threshold on CCDiff ).

34
Experimental results
Discovering lexical synonyms Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group.
Initial 2nd iter.
15/58 48/58
28/58 56/58
40/58 57/58
Discovering new instances Hide part of the known instances. Evolve ontology and grammars to recover them.
35
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • BOEMIE Bootstrapping ontology evolution with
    multimedia information extraction.
  • Open issues

36
BOEMIE - motivation
  • Multimedia content grows with increasing rates in
    public and proprietary webs.
  • Hard to provide semantic indexing of multimedia
    content.
  • Significant advances in automatic extraction of
    low-level features from visual content.
  • Little progress in the identification of
    high-level semantic features
  • Little progress in the effective combination of
    semantic features from different modalities.
  • Great effort in producing ontologies for semantic
    webs.
  • Hard to build and maintain domain-specific
    multimedia ontologies.

37
BOEMIE- approach
OTHER ONTOLOGIES
Content Collection (crawlers, spiders, etc.)
SEMANTICS EXTRACTION RESULTS
ONTOLOGY EVOLUTION
EVOLVED ONTOLOGY
COORDINATION
POPULATION ENRICHMENT
INTERMEDIATE ONTOLOGY
38
Outline
  • Motivation and state of the art
  • SKEL research
  • Vision
  • Information integration in CROSSMARC.
  • Meta-learning for information extraction.
  • Context-free grammar learning.
  • Ontology enrichment.
  • Bootstrapping ontology evolution with multimedia
    information extraction.
  • Open issues

39
KR issues
  • Is there a common formalism to capture the
    necessary semantics syntactic lexical
    knowledge for IE?
  • Is that better than having separate
    representations for different tasks?
  • Do we need an intermediate formalism (e.g.
    grammar CG ontology)?
  • Do we need to represent uncertainty (e.g. using
    probabilistic graphical models)?

40
ML issues
  • What types and which aspects of grammars and
    conceptual structures can we learn?
  • What training data do we need? Can we reduce the
    manual annotation effort?
  • What background knowledge do we need and what is
    the role of deduction?
  • What is the role of multi-strategy learning,
    especially if complex representations are used?

41
Content-type issues
  • What is the role of semantically annotated
    content in learning, e.g. as training data?
  • What is the role of hypertext as a graph?
  • Can we extract information from multimedia
    content?
  • How can ontologies and learning help improve
    extraction from multimedia?

42
SKEL Introduction
Acknowledgements
  • This is research of many current and past members
    of SKEL.
  • CROSSMARC is joint work of the project consortium
    (NCSR Demokritos, Uni of Edinburgh, Uni of Roma
    Tor Vergata, Veltinet, Lingway).
About PowerShow.com