Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management - PowerPoint PPT Presentation


PPT – Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management PowerPoint presentation | free to download - id: 3e8a6-ZTdiO


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management


... this produces term: antonyms as an output. 24 ... Dictionary (Britannica) ... If the ontology has a particular perspective on the world, then the internet may ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 54
Provided by: rac5
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management

Automating Ontology Building Ontologies for the
Semantic Web and Knowledge Management
  • Christopher BREWSTER
  • Department of Computer Science,
  • University of Sheffield

  • The Need for Ontologies and Taxonomies
  • Problems with Knowledge Acquisition
  • Methodological Criteria
  • Coherence
  • Multiplicity
  • Ease of Computation
  • Labels
  • Data Sources
  • Linking/ associating terms
  • Constructing Hierarchies
  • Labelling Relations
  • Conclusions

The Need
  • Ontologies and Taxonomies are needed for
  • Ontologies for the Semantic Web
  • Central component for agent services over the
  • Knowledge acquisition for knowledge management
  • Minds of employees Intangible assets
  • Ontologies act as index to memory of an
  • Many organisations have built or are building
    their own ontologies/taxonomies (e.g. BBC,
    British Council, Clifford Chance, etc.)

The Need (2)
  • Navigational Aids e.g. Yahoo, Northern Lights,
    corporate intranets, …
  • Component in LT systems
  • etc.
  • BUT complex hand-built taxonomies/ ontologies
    such as Microkosmos, Cyc, WordNet, etc. are not
    used in applications!

Problems Knowledge Representation
  • Widely-held Assumption knowledge can be codified
    in an ontology
  • Ontology formal explicit specification of a
    shared conceptualisation (Gruber)
  • a document or file that formally defines the
    relations among terms (Berners-Lee)

Problems Knowledge Representation (2)
Continuum Ontologies ? Taxonomies ? Other
Semantic Networks Differences lie in degree of
logical rigour, formality and the potential for
reasoning over the data structure
Problems Ontology/ Taxonomy Construction
  • Current focus formal criteria e.g. consistency,
    completeness conciseness (e.g. Gomez-Perez,
  • Idealised aspirations similar to those in
  • Common assumption users will willingly
    contribute to construction of a formal ontology
    (e.g. Stutt Motta)
  • Reality both librarian and companies know
    authors tag their texts inappropriately

Manual Labour
  • All current ontologies/ taxonomies are hand built
  • Yahoo, Northern Lights (browsable taxonomy)
  • Gene Ontology
  • Company internal (e.g. Arthur Andersen,
  • Computers cannot be relied on.
  • Some are mergers of existing taxonomies
  • Company merger ? ontology merger (e.g.

Specific Issues (1)
  • High cost of human labour in initial development
    / editorial task
  • Category construction
  • Content association
  • Knowledge is in continuous flux out of date on
    day of publication
  • Ontologies/Taxonomies need to be domain specific
  • General ontologies not very helpful without a lot
    of work

Specific Issues (2)
  • Ontologies/Taxonomies reflect a particular
    perspective on the world e.g. categories like
    business opportunity
  • Categories are abstractions, derived from an
    analytic frame work e.g. nouns or business
  • Ontologies shared conceptualisations but
    often very difficult for human being to agree on
    categorising the world (e.g. problems with global

  • Problems 1-3 imply need for automated
  • Problems 4-6 imply impossibility of such an
  • Ontology construction involves judicious
    integration of of automated methods with manual

Data Sources
  • Traditionally and currently
  • Protocol analysis
  • Introspection
  • Both slow
  • Both subjective
  • Both very costly
  • Future
  • Automated Text/Corpus analysis
  • Information Extraction from texts
  • Automated ontology building must be based on
    texts, since we cannot enter peoples minds
  • Further in Future
  • Integration with generated dialogue ….

Methodological Criteria
  • Set of criteria to
  • Guide choice and development of tools and
  • Contribute to evaluation of ontology construction
    methodologies by going beyond idealised abstract

1. Coherence
  • The algorithm(s) must produce output coherent for
    the user
  • Coherence appears to user as reflecting common
    sense i.e. shared conceptualisation of Gruber
  • Linguistic coherence ? encyclopaedic coherence
  • Tennis problem in Wordnet
  • Very difficult to evaluate
  • No criteria for degree of correctness
  • Easy to spot algorithms which produce rubbish
  • Help!

2. Multiplicity
  • Algorithm(s) must allow for multiple placement of
    the same term in the ontology
  • Multiplicity ? semantic ambiguity
  • cat ISA mammal
  • cat ISA pet
  • Classic problem in librarianship

3. Ease of Computation
  • Algorithms must not have excessive complexity and
    consequent computational processing cost.
  • Ontologies must be kept current
  • Feedback to editors must be acceptable
  • Certain algorithms have very high complexity
    (e.g. Brown et al.92 where it O(V5) where V no.
    of types in the corpus.

4. Lone Labels
  • The algorithm generates nodes with simple labels
    consisting of only one term.
  • Complex labels are not user friendly
  • Some approaches (e.g. Scatter/Gather) generate
    complex labels
  • This does not preclude synonyms acting as
    alternative labels
  • A bag of words is not acceptable

5. Data Source
  • The algorithm(s) must use texts (corpora) as
    primary data sources, AND allow the extension of
    existing ontologies.
  • Written texts are the most appropriate data
    sources (quantity, quality, accessibility)
  • Seed ontologies or existing complex data
    structures need to be taken into account, e.g.
    the companys own top-level

  • Linking terms
  • Methods to associate one word/term with another
  • Organising terms
  • Methods to organise terms into a structure e.g. a
  • Labelling term relations
  • Methods to label the relationship between terms

Linking/associating terms
  • Objective Given term a, list a set of associated
    terms b1, b2, b3, ……bn
  • Many, many techniques correspondence analysis,
    distributional analysis, using MI in a window
    etc. ….

Linking/associating terms Associated Words
  • Idea of Mike Scott (among others)
  • Input corpusa reference corpusb
  • Key words unusually frequent words in corpusa
    in comparison with reference corpusb
  • Key-key words words which are key in more than
    one text, the more text, the more key.
  • Associated words of wi key words which co-occur
    in the same texts
  • 2 factors i. comparison with reference corpus,
    and ii. cross-text frequency
  • Results can be very good (e.g. when using an
    encyclopaedia) but poor when using random texts.

Linking/associating terms Colocational
  • Idea of Hays (among others)
  • Similarity between terms is measured by number of
    identical words in a window (citation), plus
    number of identical words in identical positions
  • He argues very effective in identifying
    similarity of meaning (95 accuracy)
  • but it works only 30 of the time (i.e. only 30
    of citations show similarity to another citation)

Linking/associating terms Syntactic Similarity
  • Grefenstette (1994)
  • Texts are shallow parsed and for each term, the
    words in specific syntactic relations are
    collected as attributes.
  • The set of attributes for each term are compared
    with the set for each other term using the
    Jaccard measure.
  • Example result
  • tissue cell growth cancer liver tumor
    resistance disease lens
  • but also this produces term antonyms as an

Constructing Hierarchies
  • Objective Construct trees or Directed Acyclic
    Graphs from the terms in the vocabulary of the
  • Relations may or may not be specified between
  • Major problem is obtaining (candidate) labels for
    a specific cluster

Constructing Hierarchies (2)
  • Again many methods exist
  • Brown et al. (1992) merges classes based on MI
  • very high computational cost
  • No labels on nodes
  • No possiblity of integrating with an existing
    data structure

Constructing Hierarchies (3)
  • McMahon and Smith (1996) combine top-down with
    bottom up cluster formation.
  • lower computational cost,
  • still no labels
  • Scatter/Gather developed at Xerox
  • Strictly speaking only for documents not terms
  • Computationally tractable
  • Generates labels BUT consisting of many terms

Constructing Hierarchies (4)
  • Sanderson Croft
  • Document-based lexical subsumption
  • Generates single term labels
  • Could allow use of existing hierarchy/ taxonomy
  • Problem of coherence

Labelling Relations
  • Most difficult challenge because
  • There is no set of commonly accepted relations
    (cf. parts of speech)
  • There is no known correlation between a relation
    (e.g. meronym) and specific patterns in texts.
  • It is an open question whether there is
    sufficient lexico-syntactic encoding in texts to
    make the establishment of relations between
    concepts extractable from texts.
  • Few methods exist ….

Labelling Relations Synonyms Substitutability
  • Identification of synonyms by substitutability
    tests (Church et al. 1994)
  • Uses t-test to determine the significance of the
    overlap between the syntactic objects of
    different verbs
  • Result is a table of candidate substitutes of a
    given verb
  • BUT, the result is not always one that fits
    nicely into a familiar category such as synonymy,
    antonymy, and hyponymy (ibid.)
  • Hays (1997) similar work using collocational

Labelling Relations Hyponyms and
Lexico-syntactic patterns
  • Identification of hyponyms by lexico-syntactic
    patterns (Hearst 1994), e.g.
  • such NP as NP, 8 (or and) NP e.g. …works
    by such authors as Herrick, Goldsmith, and
  • Considerable manual effort involved in
    identifying patterns (also language specific)
  • Developed by Morin (1999) but no evaluation

Labelling Relations Adaptiva Ontology building
as IE
User System Characteristics A co-operative
model of user/system interaction in the context
of Knowledge Management
  • Characteristics of the user
  • Non-specialist
  • Can select a seed ontology
  • Can validate sentences as exemplars
  • Can label a relation exemplified
  • Characteristics of the system
  • Can process text at high speed
  • Can identify regularities
  • Can cluster patterns
  • Can establish that a relationship exists between
    term x and term y

  • Input
  • A corpus
  • A seed ontology, or some pairs of terms
  • A relation chosen or labelled by the user
  • Output
  • A set of pairs of terms associated with labelled
    lexico-syntactic patterns
  • An extended ontology
  • Key concept is an effective User Interface to
    allow user validation/ training of the system

The Adaptiva System
Adaptiva Interface
Building on Labelled Relations
  • Most existing methods inadequate
  • Need to combine methods to compensate for
    different weaknesses
  • Effective pre-processing
  • Candidate associations from statistical methods
  • Term recognition
  • Use existing ontologies (e.g. Gene Ontology) to
    provide candidate data for machine learning

Background and Foreground Knowledge in Dynamic
Ontology Construction
  • Texts as Knowledge Maintenance
  • Ontologies and Texts
  • Implicit and Explicit Knowledge
  • A Methodology
  • External resources Potentials and Limitations
  • Conclusion

A shared conceptualisation
  • An ontology is a formal, explicit specification
    of a shared conceptualisation, used to help
    programs and humans to share knowledge (Gruber
  • Shared! i.e. concepts held in common by the
    participants/ community of practice
  • Therefore the ontology is the background
    knowledge assumed by the writer/reader of a text.

Text and Knowledge
  • We want to generate ontologies from text
  • But if an ontology shared/background knowledge,
    then a writer assumes the ontology to generate
    the text

Knowledge Maintenance
  • A text is an act of knowledge maintenance
  • Re-enforcing assumptions of background knowledge
  • Altering links, associations and instantiations
    of existing concepts
  • Adding new concepts to the domain

Background Knowledge
  • If ontology background knowledge,
  • and background knowledge is implicit,
  • then the text(s) will not express the domain
  • Especially true of scientific papers
  • Less true of introductory textbooks, manuals,
    glossaries etc.
  • We expect to find specification of the
    ontological knowledge at the borders of a domain

Explicit vs. Implicit
  • Explicit ontological knowledge is found in
    lexico-syntactic phrases (Hearst 1992)
  • such NP as NP, 8 (or and) NP e.g. …works
    by such authors as Herrick, Goldsmith, and
  • NP, a NP that e.g. isolation and
    characterisation of pbp, a protein that interacts
  • NP and other NPs e.g. … malignant melanomas and
    other cancer cell types …
  • Implicit is not machine-readable e.g. death is
    a biological process implied by Britannica
    article on death

An Approach
  • Assumption 1 No matter how large the corpus, a
    major part of the domain ontology will not be
  • Assumption 2 texts do specify explicitly
    ontological relations between terms
  • Therefore Go beyond the corpus! Seek external
    sources to compensate the deficiencies of the

Using external sources
External sources
Mid and high level ontology
Ontology Learner
Low level ontology
External Sources
  • Encyclopaedias
  • Textbooks and manuals
  • Glossaries (manually identified)
  • Google glossaries (i.e. automatically identified)
  • The Internet
  • There are pros and cons for each of these
    potential sources

Some data
  • Using the Gene Ontology
  • Using the subset of Nature Corpus
  • 10 pairs of terms

(No Transcript)
Example results
Overall results
  • Using the internet has the highest success rate
    but still very poor

  • Ambiguity Many terms are defined for entirely
    different domains or contexts.
  • Perspective If the ontology has a particular
    perspective on the world, then the internet may
    not reflect that, i.e. the internet citations may
    dumb down.
  • Data sparsity Zipf law implies certain limits
  • Ontological coarseness vasoconstriction IS-A
    circulation cannot be found

  • A domain ontology reflects background knowledge
  • This implicit i.e. never explicitly stated
  • No corpus will ever provide sufficient citations
    to construct the corresponding ontology
  • External sources need to accessed

Research Issues
  • What external sources exist? What specialised
    sources exist?
  • How does one (automatically ?) identify them? A
    web services application?
  • How can we determine what knowledge is absent
    from the corpus and decide to search elsewhere?
  • Can we use external sources to vote on the
    ontological statement to be derived?
  • What do we trust? The domain corpus, the
    internet, our human intuition? Why?