Automatic term categorization by extracting knowledge from theWeb - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Automatic term categorization by extracting knowledge from theWeb

Description:

The snippets returned by the search engines submitting the entity as query ... top-scored S snippets are collected. the terms in the snippets are used to build ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 22
Provided by: grif9
Category:

less

Transcript and Presenter's Notes

Title: Automatic term categorization by extracting knowledge from theWeb


1
Automatic term categorizationby extracting
knowledge from theWeb
  • Leonardo Rigutini, Ernesto Di Iorio, Marco
    Ernandes e Marco Maggini
  • Dipartimento di Ingengeria dellInformazione
  • Università degli studi di Siena
  • rigutini,diiorio,ernandes,maggini_at_dii.unisi.it

2
Text mining
  • To provide semantic information to entities
    extracted by the text documents
  • Uses of thesauri, gazetteers and domain specific
    lexicons
  • Problems in maintenance of these resources
  • Large amount of human effort in tracking changes
    and in adding new lexical entities

3
Term Categorization
  • Key task in text mining research area
  • A lexicon or a more articulated structure
    (ontology) can be automatically populated by
    associating each unknown lexical entity to one o
    more semantic categories
  • The goal of term categorization is to label
    lexical entities using a set of semantic themes
    (disciplines, domains)

4
Lexicons
  • Domains-specific lexicons have been used in
    several tasks
  • Word-sense disambiguation
  • the semantic categories of the terms surrounding
    the target word help to disambiguate it
  • Query-expansion
  • adding semantic information to the query make it
    more specific and focused and it increases the
    precision of the answer

5
Lexicons
  • Cross-lingual text categorization
  • Ontologies are used to replace particualr
    entities with their semantic category, thus
    reducing the temporal and geographic dependency
    of the content of documents. Entities like proper
    names or factory brands depend by the country and
    by the time to which the document has been
    produced. Replacing them with their semantic
    category (politician or singer, computer or shoes
    factory) improve the categorization of text
    documents.

6
Automatic term categorization
  • Several attempts to face the problem of automatic
    expansion of ontologies and thesauri have been
    proposed in literature
  • F. Cebah1 proposed two possible approach to the
    problem
  • Exogenous, where the sense of a term is inferred
    by the context in which it appears
  • Endogenous, in which the sense of a term relies
    only on statistical information extracted by the
    sequence of characthers constituting the entity

1 F.Cerbah,Exogenous and endogenous approaches
to semantic categorization of unknown technical
terms,in Proceedings of the 18th International
Conference on Computational Linguistic (COLING)
7
Automatic term categorization
  • Sebastiani et al. proposed an exogenous approach
    that faced the problem as the dual of the text
    categorization2
  • the Reuter corpus provided the knowledge base to
    classify terms
  • they tried to replicate the Wordnet Domains
    ontology selecting only the terms appearing in
    the Reuter corpus
  • Their approach showed low F1 values
  • high precision but a very small recall ( 0.4
    )

2 H. Avancini,.A. Lavelli, B.Magnini,
F.Sebastiani, R. Zanoli,Expanding
domain-specific lexicons by term categorization,
in Proceedings of the 2003 ACM symposium on
Applied computing (SAC03)
8
The proposed system
  • We propose a system to automatically categorize
    entities that exploit the Web to build an
    enriched representation of the entity, the Entity
    Context Lexicon (ECL)
  • the ECL is a list of all the words appearing in
    the context of the entity
  • for each word, some statistical information are
    storedterm frequency, snippet frequency etc
  • basically, an ECL is a bag-of-words
    representation of the words appearing in the
    context of th entity
  • The idea is that entities of the same semantic
    category should appear in similar contexts

9
System description
  • The system for term classification is composed by
    two modules
  • the training module is used to train the
    classifier from a set of labeled examples
  • the entity classification module is applied to
    predict the appropriate categoryfor a given input
    entity
  • Both modules exploit the Web to build the ECL
    representation of the entities
  • They are composed by some sub-modules
  • two ECL generators, which build the ECLs
  • the classifier, which is trained to classify the
    unknown ECL

10
System block diagram
11
The ECL generator
  • We choose to use the Web as knowledge base to
    build the ECLs
  • The snippets returned by the search engines
    submitting the entity as query report the
    contexts in which the terms in the query appear
  • The ECL of an entity e is simply the set of the
    context terms extracted from the snippets

12
The ECL generator
  • Given an entity e
  • it is submitted as query to a search engine
  • the related top-scored S snippets are collected
  • the terms in the snippets are used to build the
    ECL
  • for each word the term frequency and the snippet
    frequency are stored
  • In order to avoid inclusion of not significant
    terms, a stop-words list or feature selection
    technique can be used

13
The classifier
  • Each entity e is characterized by the
    corresponding ECLe
  • thus a set of labeled ECLs can be used to train
    an automatic classifier
  • then the trained classifier can be used to label
    the unlabeld ECLs
  • The most common classifier models can be used
  • SVM, Naive Bayes, Complement Naive Bayes and
    profile based (es. Rocchio).

14
The CCL classifier
  • Following the idea that similar entities appears
    in similar contexts, we exploited a new type of
    profile-based classifier
  • a profile for each class is built by merging the
    training ECLs associated to that class
  • for each term in the profile is evaluated a
    weight using a weighting function W
  • the obtained lexicon is called Class Context
    Lexicon (CCL)
  • a similarity function is used to measure the
    similarity of an unlabeled ECL with each CCL
  • When an unlabeled ECL is passed to the
    classifier
  • it is assigned to the class reporting the highest
    similarity score

15
The CCL classifer
  • Weighting functions
  • tf
  • tf-idf
  • snippet-frequency inverse class frequency
    (sficf), which provides high scores to a word if
    it is much frequent in a class and low frequent
    in the remaining classes
  • Similarity functions
  • Euclidean similarity
  • Cosine similarity
  • Gravity similarity

16
Experimental results
  • We selected 8 categories
  • soccer, music, location, computer, politics,
    food, philosophy, medicine
  • For each of them we collected predefined
    gazetteers from the web and sampled 200 entities
    for each class. We performed tests varying the
    size of the learning set LM where M indicates the
    number of learning entities per class
  • We used Google as seach engine and we set the
    number of snippets selected for building the ECL
    to 10 (S10)

17
Experimental results
  • We tested all the classifiers listed previously
  • SVM, NB, CNB and CCL
  • and we used the F1 values to measure the
    performances of the system
  • Firstly, we tested the CCL classifier by
    combining the weighting functions and the
    similarity functions listed previously
  • We selected the CCL configuration reporting the
    best performances and then we compared it with
    the SVM, NB and CNB classifiers

18
Performances using the CCL classifiers
  • We selected the CCL-sficf-gravity configuration
    as the CCL classifier reporting the best
    performances

19
Overall perfromances
  • The CNB classifier showed the best performances
    even if the CCL model results are comparable

20
Conclusions
  • We propose a system for Web based term
    categorization oriented to automatic thesaurus
    construction
  • The idea is that terms from a same semantic
    category should appear in very similar contexts,
    i.e. that contain approximately the same words.
  • the system builds an Entity Context Lexicon (ECL)
    for each entity using the Web as knowledge base
  • this enriched representaion is used to train an
    automatic classifier
  • We tested all the most common classifier models
    (SVM, Naive Bayes and Complement Naive Bayes)
  • Moreover, we propose a profile-based classifier
    called CCL that builds the class profiles by
    merging the learning ECLs

21
Conclusions
  • The experimental results shows that the CNB
    classifier reports the best performances
  • However, the CCL classifier results are very
    promising and comparable with the CNB ones
  • Additional tests have been planned considering a
    multi-label classification task and to verify the
    robustness of the system in out-of-topic cases
Write a Comment
User Comments (0)
About PowerShow.com