PowerPoint Presentation - A New Term Weighting Method for Text Categorization - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

PowerPoint Presentation - A New Term Weighting Method for Text Categorization

Description:

Machine Learning: SVM, kNN, decision Tree, Na ve Bayesian, Neural Network, Linear ... Note: chi^2, ig (information gain), gr (gain ratio), mi (mutual information) ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 82
Provided by: soc128
Category:

less

Transcript and Presenter's Notes

Title: PowerPoint Presentation - A New Term Weighting Method for Text Categorization


1
A New Term Weighting Method for Text
Categorization LAN Man School of
Computing National University of Singapore 16
Apr, 2007
2
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

3
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

4
Introduction Background
  • Explosive increase of textual information
  • Organizing and accessing these information
  • in flexible ways
  • Text Categorization (TC) is the task of
  • classifying natural language documents into
  • a predefined set of semantic categories

5
Introduction Applications of TC
  • Categorize web pages by topic (the directories
    like
  • Yahoo!)
  • Customize online newspapers to different labels
  • according to a particular users reading
    preferences
  • Filter spam emails and forward incoming emails
    to
  • the target expert by content
  • Word sense disambiguation is also taken as a
    text
  • categorization task once we view word
    occurrence
  • contexts as documents and word sense as
    categories.

6
Introduction Two sub-issues of TC
  • TC is a discipline at the crossroads of
  • machine learning (ML) and information
  • retrieval (IR)
  • Two key issues in TC
  • Text Representation
  • Classifier Construction
  • This thesis focuses on the first issue.

7
Introduction Construction of Text Classifier
  • Approaches to build a classifier
  • No more than 20 algorithms
  • Borrowed from Information Retrieval Rocchio
  • Machine Learning SVM, kNN, decision Tree,
  • Naïve Bayesian, Neural Network, Linear
  • Regression, Decision Rule, Boosting, etc.
  • SVM performs better.

8
Introduction Construction of Text Classifier
  • Little room to improve the performance from the
    algorithm aspect
  • Excellent algorithms are few.
  • The rationale is inherent to each algorithm and
  • the method is usually fixed for one given
    algorithm
  • Tuning the parameter has limited improvement

9
Introduction Text Representation
  • Various text format, such as DOC, PDF,
  • PostScript, HTML, etc.
  • Can Computer read them like us?
  • Convert them into a compact format in order to
  • be recognized and categorized for classifiers
    or a
  • classifier-building algorithm in a computer.
  • This indexing procedure is also called
  • text representation.

10
Introduction Vector Space Model
Texts are vectors in the term space. Assumption
Documents that are close together in space are
similar in meaning.
11
Introduction Text Representation
  • Two main issues in Text Representation
  • 1. What should a term be Term Type
  • 2. How to weight a term Term Weighting

12
Introduction 1. What Should a Term be?
  • Sub-word level syllables
  • Word level single token
  • Multi-word level phrases, sentences, etc
  • Syntactic and semantic sense (meaning)

13
Introduction 1. What Should a Term be?
  • 1. Word senses (meanings) Kehagias 2001
  • same word assumes different meanings in a
    different contexts
  • 2. Term clustering Lewis 1992
  • group words with high degree of pairwise semantic
    relatedness
  • 3. Semantic and syntactic representation Scott
    matwin 1999
  • Relationship between words, i.e. phrases,
    synonyms and hypernyms

14
Introduction 1. What Should a Term be?
  • 4. Latent Semantic Indexing Deerwester 1990
  • A feature reconstruction technique
  • 5. Combination Approach Peng 2003
  • Combine two types of indexing terms, i.e.
    words
  • and 3-grams
  • 6. Theme Topic Mixture Model Keller 2004
  • Graphical Model
  • 7. Using keywords from summarization Li 2003

15
Introduction 1. What Should a Term be?
  • In general, high level representation did not
    show good performance in most cases
  • Word level is better (bag-of-words)

16
Introduction 2. How to Weight a Term?
  • Salton 1988 elaborated three considerations
  • 1. term occurrences closely represent the content
    of document
  • 2. other factors with the discriminating power
    pick up the relevant documents from other
    irrelevant documents
  • 3. consider the effect of length of documents

17
Introduction 2. How to Weight a Term?
  • Simplest method binary
  • Most popular method tf.idf
  • Combination with information-theory metrics
  • or statistics method tf.chi2, tf.ig, tf.gr
  • Combination with the linear classifier

18
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

19
Motivation (1)
  • SVM is better.
  • Leopold(2002) stated that text representation
  • dominates the performance of text
    categorization
  • rather than the kernel functions of SVMs.
  • Which is the best method for SVM-based text
    classifier among these widely-used ones?

20
Motivation (2)
  • Text categorization is a form of supervised
    learning
  • The prior information on the membership of
  • training documents in predefined categories is
  • useful in
  • feature selection
  • supervised learning for classifier

21
Motivation (2)
  • The supervised term weighting methods adopt
  • this known information and consider the
  • document distribution.
  • They are naturally expected to be superior to
  • the unsupervised (traditional) term weighting
  • methods.

22
MotivationThree questions to be addressed
  • Q1. How can we propose a new effective term
    weighting method by using the important prior
    information given by the training data set?
  • Q2. Which is the best term weighting method for
    SVM-based text classifier?

23
MotivationThree questions to be addressed
  • Q3. Are supervised term weighting methods able to
    lead to better performance than unsupervised ones
    for TC?
  • What kinds of relationship can we find
    between term weighting methods and the
    widely-used learning algorithms, i.e. kNN and
    SVM, given different benchmark data collections?

24
MotivationSub-tasks of This Thesis
  • First, we will analyze terms discriminating
    power for TC and propose a new term weighting
    method using the prior information of the
    training data set to improve the performance of
    TC.

25
MotivationSub-tasks of This Thesis
  • Second, we will explore term weighting methods
    for SVM-based text categorization and investigate
    the best method for SVM-based text classifier.

26
MotivationSub-tasks of This Thesis
  • Third, we will extend research work on more
    general benchmark data sets and learning
    algorithms to examine the superiority of
    supervised term weighting methods and investigate
    the relationship between term weighting methods
    and different learning algorithms. Moreover, this
    study will be extended to a new application
    domain, i.e. biomedical literature classification.

27
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

28
Methodology of Research
  • Machine Learning Algorithms
  • SVM, kNN
  • Benchmark Data Corpora
  • Reuters News Corpus
  • 20Newsgroups Corpus
  • Ohsumed Corpus
  • 18Journal Corpus

29
Methodology of Research
  • Evaluation Measures
  • F1
  • Breakeven point
  • Significance Tests
  • McNemars significance Test

30
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

31
Analysis and Proposal of a New Term Weighting
Method
  • Three considerations
  • 1. Term occurrence binary, tf, ITF, log(tf)
  • 2. Terms discriminative power idf
  • Note chi2, ig (information gain), gr (gain
    ratio), mi (mutual information),
  • or (Odds Ratio), etc.
  • 3. Document length cosine normalization, linear
    normalization

32
Analysis and Proposal of a New Term Weighting
MethodAnalysis of Terms Discriminating Power
33
Analysis and Proposal of a New Term Weighting
MethodAnalysis of Terms Discriminating Power
  • Assume they have same tf value. t1, t2, t3 share
    the same idf1 value t4, t5, t6 share the same
    idf2 value.
  • Clearly, the six terms contribute differently to
    the semantic of documents.

34
Analysis and Proposal of a New Term Weighting
MethodTerms Discriminating Power -- idf
  • Case 1. t1 contributes more than t2 and t3 t4
    contributes more than t5 and t6.
  • Case 2. t4 contributes more than t1 although
    idf(t4) lt idf(t1).

35
Analysis and Proposal of a New Term Weighting
MethodTerms Discriminating Power -- idf, chi2,
or
  • Case 3. for t1, t2, t3,
  • in idf value, t1 t2 t3
  • in chi2 value, t1t3 gtt2
  • in or value, t1 gt t2 gt t3
  • Case 4. for t1 and t4,
  • in idf value, t1 gt t4
  • in chi2 and or value, t1 lt t4

36
Analysis and ProposalA New Term Weighting Method
-- rf
  • Intuitive Consideration
  • the more concentrated a high frequency term is in
    the positive category than in the negative
    category, the more contributions it makes in
    selecting the positive samples from among the
    negative samples.

37
Analysis and ProposalA New Term Weighting Method
-- rf
  • rf relevance frequency
  • The rf is only related to the ratio of b and c,
    not involve d
  • The base of log is 2.
  • in case of c0, c1
  • The final weight of term is tfrf

38
Analysis and ProposalEmpirical observation
Comparison of idf, rf, chi2, or, ig and gr value
of four features in category 00_acq of Reuters
Corpus
feature Category 00_acq Category 00_acq Category 00_acq Category 00_acq Category 00_acq Category 00_acq
feature idf rf chi2 or ig gr
Acquir 3.553 4.368 850.66 30.668 0.125 0.161
Stake 4.201 2.975 303.94 24.427 0.074 0.096
Payout 4.999 1 10.87 0.014 0.011 0.014
dividend 3.567 1.033 46.63 0.142 0.017 0.022
39
Analysis and ProposalEmpirical observation
Comparison of idf, rf, chi2, or, ig and gr value
of four features in category 03_earn of Reuters
Corpus
feature Category 03_earn Category 03_earn Category 03_earn Category 03_earn Category 03_earn Category 03_earn
feature idf rf chi2 or ig gr
Acquir 3.553 1.074 81.50 0.139 0.031 0.032
Stake 4.201 1.082 31.26 0.164 0.018 0.018
Payout 4.999 7.820 44.68 364.327 0.041 0.042
dividend 3.567 4.408 295.46 35.841 0.092 0.095
40
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

41
Experiment Set 1Exploring the best term
weighting method for SVM-based text categorization
  • Purposes of Experiment Set 1
  • 1. Explore the best term weighting method for
    SVM-based text categorization (Q2)
  • 2. Compare tf.rf with various traditional term
    weighting methods on SVM (Q1)

42
Experiment Set 1Experimental Methodology 10
Methods
Methods Methods Denotation Description
Traditional Weighting Methods Term frequency binary 0 absence, 1 presence
Traditional Weighting Methods Term frequency tf Term frequency alone
Traditional Weighting Methods Term frequency logtf log(1tf)
Traditional Weighting Methods Term frequency ITF 1-1/(1tf)
Traditional Weighting Methods tf.idf And variants idf idf alone
Traditional Weighting Methods tf.idf And variants tf.idf Classical tf idf
Traditional Weighting Methods tf.idf And variants logtf.idf log(1tf) idf
Traditional Weighting Methods tf.idf And variants tf. idf_prob tf Probabilistic idf
Supervised Weighting Methods tf.chi2 tf chi square
Supervised Weighting Methods Our new tf.rf tf.rf Our new method
43
Experiment Set 1Results on Reuters Corpus
44
Experiment Set 1Results on subset of
20Newsgroups Corpus
45
Experiment Set 1Conclusions (1)
  • tf.rf performs better consistently
  • No significant difference among the three
  • term frequency variants, i.e. tf, logtf and ITF
  • No significant difference between tf.idf,
  • logtf.idf and tf.idf-prob

46
Experiment Set 1Conclusions (2)
  • idf and chi2 factor, even considering the
  • distribution of documents, do not improve or
  • even decrease the performance
  • binary and tf.chi2 significantly underperform
  • the other methods

47
Experiment Set 2Investigating supervised term
weighting method and their relationship with
learning algorithms
  • Purposes of Experiment Set 2
  • 1. Investigate supervised term weighting methods
    and their relationship with learning algorithms
    (Q3)
  • 2. Compare the effectiveness of tf.rf under more
    general experimental circumstances (Q1)

48
Experiment Set 2Review
  • Supervised term weighting methods
  • Use the prior information on the membership
  • of training documents in predefined categories
  • Unsupervised term weighting methods
  • Does not use.
  • binary, tf, log(1tf), ITF
  • Most popular is tf.idf and its variants
    logtf.idf,
  • tf.idf-prob

49
Experiment Set 2Review Supervised Term
Weighting Methods
  • 1. Combined with information-theory functions or
    statistic metrics
  • such as chi2, information gain, gain ratio, Odds
    ratio, etc.
  • Used in the feature selection step
  • Select the most relevant and discriminating
    features for the classification task, that is,
    the terms with higher feature selection scores
  • The results are inconsistent and/or incomplete.

50
Experiment Set 2Review Supervised Term
Weighting Methods
  • 2. Interaction with supervised text classifier
  • Linear SVM, Perceptron, kNN
  • Text classifier selects the positive test
  • documents from negative test documents by
  • assigning different scores to the test samples,
  • these scores are believed to be effective in
  • assigning more appropriate weights to terms

51
Experiment Set 2Review Supervised Term
Weighting Methods
  • 3. Based on Statistical Confidence Intervals --
    ConfWeight
  • Linear SVM
  • Compared with tf.idf and tf.gr (gain ratio)
  • The results failed to show that supervised
  • methods are generally higher than
  • unsupervised methods.

52
Experiment Set 2
  • Hypothesis
  • Since supervised schemes consider the document
    distribution, they should perform better than
    unsupervised ones
  • Motivation
  • Are supervised term weighting method is
  • superior to the unsupervised (traditional)
    methods?
  • 2. What kinds of relationship between term
  • weighting methods and the learning algorithms,
  • i.e. kNN and SVM, given different data
    collections?

53
Experiment Set 2Methodology
Unsupervised Term Weighting binary 0 or 1
Unsupervised Term Weighting tf Term frequency
Unsupervised Term Weighting tf.idf Classic scheme
Supervised Term Weighting tf.rf Own scheme
Supervised Term Weighting tf.chi2 Chi2
Supervised Term Weighting tf.ig Information gain
Supervised Term Weighting tf.or Odds ratio
54
Experiment Set 2Results on Reuters Corpus using
SVM
55
Experiment Set 2Results on Reuters Corpus using
kNN
56
Experiment Set 2Results on 20Newsgroups Corpus
using SVM
57
Experiment Set 2Results on 20Newsgroups Corpus
using kNN
58
Experiment Set 2McNemars Significance Tests
Algorithm Corpus _fea Significance Test Results
SVM R 15937 (tf.rf, tf, rf) gt tf.idf gt (tf.ig, tf.chi2, binary) gtgt tf.or
SVM 20 13456 (rf, tf.rf, tf.idf) gt tf gtgt binary gtgt tf.or gtgt (tf.ig, tf.chi2)
kNN R 405 (binary, tf.rf) gt tf gtgt (tf.idf, rf, tf.ig) gt tf.chi2 gtgt tf.or
kNN 20 494 (tf.rf, binary, tf.idf,tf) gtgt rf gtgt (tf.or, tf.ig, tf.chi2)
59
Experiment Set 2 Effects of feature set size on
algorithms
  • For SVM, almost all methods achieved the
  • best performance when in putting the full
  • vocabulary (13000-16000 features)
  • For kNN, the best performance achieved at a
  • smaller feature set size (400-500 features)
  • Possible reason different noise resistance

60
Experiment Set 2 Conclusions (1)
  • Q1 Are supervised term weighting methods
    superior to the unsupervised term weighting
    methods?
  • -- No always.

61
Experiment Set 2 Conclusions (1)
  • Specifically, three supervised methods based
  • on information theory, i.e. tf.chi2, tf.ig and
  • tf.or, perform rather poorly in all
    experiments.
  • On the other hand, newly proposed supervised
  • method, tf.rf achieved the best performance
  • consistently and outperforms other methods
  • substantially and significantly.

62
Experiment Set 2 Conclusions (2)
  • Q2 What kinds of relationship between term
    weighting methods and learning algorithms, given
    different benchmark data collections?
  • A The performance of the term weighting methods,
    especially, the three unsupervised methods, has
    close relationships with the learning algorithms
    and data corpora.

63
Experiment Set 2 Conclusions (4)
  • Summary of supervised methods
  • tf.rf performs consistently better in all
  • experiments
  • tf.or, tf.chi2 and tf.ig perform consistently
  • the worst in all experiments
  • rf alone, shows a comparable performance
  • to tf.rf except for on Reuters using kNN

64
Experiment Set 2 Conclusions (5)
  • Summary of unsupervised methods
  • tf.idf performs comparable to tf.rf on the
    uniform
  • category corpus either using SVM or kNN
  • binary performs comparable to tf.rf on both
  • corpora using kNN while rather bad using SVM
  • although tf does not perform as well as tf.rf,
    it
  • performs consistently well in all experiments

65
Experiment Set 3Applications on Biomedical
Domain
  • Purpose
  • Application on biomedical data collections
  • Motivation
  • Explosive growth of biomedical research

66
Experiment Set 3Data Corpora
  • The Ohsumed Corpus
  • Used by Joachims 1998
  • 18 Journals Corpus
  • Biochemistry and Molecular Biology
  • Top impact factor
  • 7903 documents from two-year span (2004 - 2005)

67
Experiment Set 3Methodology
  • Four term weighting methods
  • bianry, tf, tf.idf, tf.rf
  • Evaluation Measures
  • Micro-averaged breakeven point
  • F1

68
Experiment Set 3 Results on the Ohsumed Corpus
69
Experiment Set 3 The best performance of SVM
with four term weighting schemes on the Ohsumed
Corpus
scheme Micro-R Micro-P Micro-F1 Macro-F1
binary 0.6091 0.6097 0.6094 0.5757
tf 0.6578 0.6566 0.6572 0.6335
tf.idf 0.6567 0.6588 0.6578 0.6407
tf.rf 0.6810 0.6800 0.6805 0.6604
70
Experiment Set 3Comparison between our study and
Joachims study
  • Our linear SVM is more accurate
  • 68.05 (linear) vs 60.7(linear) and 66.1(rbf)
  • The performances of tf.idf are identical
  • 65.78 vs 66
  • Our proposed tf.rf performs significantly better
    than
  • other methods.

71
Experiment Set 3Results on 18 Journals Corpus
72
Experiment Set 3Results on 3 subsets of 18
Journals Corpus
73
Experiment Set 3Conclusions
  • The comparison between our results and
    Joachims results shows that tf.rf can improve on
    the classification performance than tf.idf.
  • tf.rf outperforms the other term weighting
    methods on the both data corpora.
  • tf.rf can improve the classification performance
    of biomedical text classification.

74
Outline
  • Introduction
  • Motivation
  • Methodology of Research
  • Analysis and Proposal of a New Term
  • Weighting Method
  • Experimental Research Work
  • Contributions and Future Work

75
Contributions of Thesis
  • To propose an effective supervised term
  • weighting method tf.rf to improve the
    performance of text categorization.

76
Contributions of Thesis
  • 2. To make an extensive comparative study
  • of different term weighting methods under
  • controlled conditions.

77
Contributions of Thesis
  • 3. To give a deep analysis of the terms
  • discriminating power for text categorization
  • from the qualitative and quantitative aspects.

78
Contributions of Thesis
  • 4. To investigate the relationships between
  • term weighting methods and learning
  • algorithms given different corpora.

79
Future Work (1)
  • Extending term weighting methods
  • on feature types other than word

80
Future Work (2)
  • 2. Applying term weighting methods to
  • other text-related applications

81
  • Thanks for your time and suggestion.
Write a Comment
User Comments (0)
About PowerShow.com