Background Knowledge for Ontology Construction - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Background Knowledge for Ontology Construction

Description:

Documents are encoded as vectors. Each element of vector corresponds to frequency of one word ... System for semi-automatic ontology construction. Why semi-automatic? ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 12
Provided by: www278
Category:

less

Transcript and Presenter's Notes

Title: Background Knowledge for Ontology Construction


1
Background Knowledge for Ontology Construction
  • Bla Fortuna,
  • Marko Grobelnik,
  • Dunja Mladenic,
  • Institute Joef Stefan, Slovenia

2
Bag-of-words
  • Documents are encoded as vectors
  • Each element of vector corresponds to frequency
    of one word
  • Each word can also be weighted corresponding to
    the importance of the word
  • There exist various ways of selecting word
    weights. In our paper we propose a method to
    learn them!

Important
Word Weigts
Noise
3
SVM Feature selection
  • Input
  • Set documents
  • Set of categories
  • Each document is assigned a subset of categories
  • Output
  • Ranking of words according to importance
  • Intuition
  • Word is important if it discriminates documents
    according to categories.
  • Basic algorithm
  • Learn linear SVM classifier for each of the
    categories.
  • Word is important if it is important for
    classification into any of the categories.
  • Reference
  • Brank J., Grobelnik M., Milic-Frayling N.
    Mladenic D. Feature selection using support
    vector machines.

4
Word weight learning
  • Algorithm
  • Calculate linear SVM classifier for each category
  • Calculate word weights for each category from SVM
    normal vectors. Weight for i-th word and j-th
    category is
  • Final word weights are calculated separately for
    each document
  • The word weight learning method is based on SVM
    feature selection.
  • Besides ranking the words it also assigns them
    weights based on SVM classifier.
  • Notation
  • N number of documents
  • x1, , xN documents
  • C(xi) set of categories for document xi
  • n number of words
  • w1, , wn word weights
  • nj1, , njn SVM normal vector for j-th
    category

5
OntoGen system
  • System for semi-automatic ontology construction
  • Why semi-automatic?The system only gives
    suggestions to the user, the user always makes
    the final decision.
  • The system is data-driven and can scale to large
    collections of documents.
  • Current version focused on construction of Topic
    Ontologies, next version will be able to deal
    with more general ontologies.
  • Can import/export RDF.
  • There is a big divide between unsupervised and
    fully supervised construction tools.
  • Both approaches have weak points
  • it is difficult to obtain desired results using
    unsupervised methods, e.g. limited background
    knowledge
  • manual tools (e.g. Protégé, OntoStudio) are time
    consuming, user needs to know the entire domain.
  • We combined these two approaches in order to
    eliminate these weaknesses
  • the user guides the construction process,
  • the system helps the user with suggestions based
    on the document collection.

http//kt.ijs.si/blazf/examples/ontogen.html
6
How does OnteGen help?
  • By identifying the topics and
  • relations between them
  • using k-means clustering
  • cluster of documents gt topic
  • documents are assigned to clusters gt
    subject-of relation
  • We can repeat clustering on a subset of documents
    assigned to a specific topic gt identifies
    subtopics and subtopic-of relation
  • By naming the topics
  • using centroid vector
  • A centroid vector of a given topic is the average
    document from this topic (normalised sum of
    topics documents)
  • Most descriptive keywords for a given topic are
    the words with the highest weights in the
    centroid vector.
  • using linear SVM classifier
  • SVM classifier is trained to seperate documents
    of the given topic from the other document in
    the context
  • Words that are found most mportant for the
    classification are selected as keywords for the
    topic

7
(No Transcript)
8
Topic ontology of Yahoo! Finances
9
Background knowledge in OntoGen
  • All of the methods in OntoGen are based on
    bag-of-words representation.
  • By using a different word weights we can tune
    these methods according to the users needs.
  • The user needs to group the documents into
    categories. This can be done efficiently using
    active learning.

http//kt.ijs.si/blazf/examples/ontogen.html
10
Influence of background knowledge
  • Data Reuters news articles
  • Each news is assigned two different sets of tags
  • Topics
  • Countries
  • Each set of tags offers a different view on the
    data

Topics view
Countries view
Documents
11
Links
  • OntoGen
  • http//ontogen.ijs.si/
  • Text Garden
  • http//www.textmining.net/
Write a Comment
User Comments (0)
About PowerShow.com