A Neural Network Approach to Topic Spotting - PowerPoint PPT Presentation

About This Presentation
Title:

A Neural Network Approach to Topic Spotting

Description:

conversion to lower case. elimination of words appeared in fewer three documents. Representations ... Present document to each of the topic networks ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 57
Provided by: Lou49
Learn more at: http://cs.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: A Neural Network Approach to Topic Spotting


1
A Neural Network Approach to Topic Spotting
  • Presented by Loulwah AlSumait
  • INFS 795 Spec. Topics in Data Mining
  • 4.14.2005

2
Article Information
  • Published in
  • Proceedings of SDAIR-95, 4th Annual Symposium on
    Document Analysis and Information Retrieval 1995
  • Authors
  • Wiener E.,
  • Pedersen, J.O.
  • Weigend, A.S.
  • 54 citations

3
Summary
  • Introduction
  • Related Work
  • The Corpus
  • Representation
  • Term Selection
  • Latent Semantic Indexing
  • Generic LSI
  • Local LSI
  • Cluster-Directed LSI
  • Topic-Directed LSI
  • Relevancy Weighting LSI

4
Summary
  • Neural Network Classifier
  • Neural Networks for Topic Spotting
  • Linear vs. Non Linear Networks
  • Flat Architecture vs. Modular Architecture
  • Experiment Results
  • Evaluating Performance
  • Results discussions

5
Introduction
  • Topic Spotting Text Categorization Text
    Classification
  • Problem of identifying which of a set of
    predefined topics are present in a natural
    language document.

Document
Topic 1
Topic 2
Topic n
6
Introduction
  • Classification Approaches
  • Expert system approach
  • manually construct a system of inference rules on
    top of large body of linguistic and domain
    knowledge
  • could be extremely accurate
  • very time consuming
  • brittle to changes in the data environment
  • Data driven approach
  • induce a set of rules from a corpus of labeled
    training documents
  • practically better

7
Introduction Related Work
  • The major remarks regarding the related work
  • Separate classifier was constructed for each
    topic.
  • Different set of terms was used to train each
    classifier.

8
Introduction The Corpus
  • Reuters 22173 corpus of Reuters newswire stories
    from 1987
  • 21,450 stories
  • 9,610 for training
  • 3,662 for testing
  • mean length 90.6 words, SD 91.6
  • 92 topics appeared at least once in the training
    set.
  • The mean is 1.24 topics/doc. (up to 14 topics for
    some doc.)
  • 11,161 unique terms after preprocessing
  • inflectional stemming,
  • stop word removal,
  • conversion to lower case
  • elimination of words appeared in fewer three
    documents

9
Representations
  • starting point
  • Document Profile term by document matrix
    containing word frequency entries

10
Representation
Thorsten Joachims. 1997. Text Categorization with
Support Vector Machines Learning with Many
Relevant Features. http//citeseer.ist.psu.edu/jo
achims97text.html
11
Representation - Term Selection
  • the subset of the original terms that are most
    useful for the classification task.
  • Difficult to select terms that discriminate
    between 92 classes while being small enough to
    serve as the feature set for a neural network
  • Divide problem into 92 independent classification
    tasks
  • Search for best discriminator terms between
    documents with the topic and those without

12
Representation - Term Selection
No. of doc. w/ topic t contain term k
  • Relevancy Score
  • measures how unbalanced the term is across
    documents w/ or w/o the topic
  • Highly ve and highly -ve scores indicate useful
    terms for discrimination
  • using about 20 terms yielded the best
    classification performance

Total No. of doc. w/ topic t
13
Representation - Term Selection
14
Representation - Term Selection
  • Advantage
  • little computation is required
  • resulting features have direct interpretability
  • Drawback
  • many of best individual predictors contain
    redundant information
  • a term which may appear to be a very poor
    predictor on its own may turn out to have great
    discriminative power in combination with other
    terms, and vise verse.
  • Apple vs. Apple Computers
  • Selected Term Representation (TERMS) with 20
    features

TERMS
15
Representation LSI
  • Transform original doc to lower-dimensional space
    by analyzing correlational structure of terms in
    the document collection
  • (Training Set) applying a singular-value
    decomposition (SVD) to the original term by
    document matrix ? Get U, ?, V
  • (test set) Transform document vectors by
    projecting them into LSI space
  • Property of LSI higher dimensions capture less
    of variance of original data ? drop w/ minimal
    loss.
  • Found performance continues to improve up to at
    least 250 dimensions
  • Improvement rapidly slows dawn after about 100
    dimensions
  • Generic LSI Representation (LSI) with 200 features

LSI
16
Representation LSI
Reuters Corpus
Generic LSI Representation w/ 200 features
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
17
Representation Local LSI
  • Global LSI performs worse as frequency decreases
  • infrequent topics are usually indicated by
    infrequent terms and infrequent terms may be
    projected out of LSI and considered as mere
    noise.
  • Proposed two task-directed methods that make use
    of prior knowledge of the classification task

18
Representation Local LSI
  • What is Local LSI?
  • modeling only the local portion of the corpus
    related to those topics
  • includes documents that use terminology related
    to the topics (not necessary have any of the
    topics assigned)
  • Performing SVD over only the local set of
    documents
  • representation more sensitive to small, localized
    effects of infrequent terms.
  • representation more effective for classification
    of topics related to that local structure.

19
Representation Local LSI
  • Type of Local LSI
  • Cluster Directed representation
  • 5 Meta-topics (clusters)
  • Agriculture, Energy, Foreign exchange,
    Government, and metals
  • How to construct local region?
  • Break corpus into 5 clusters ? each containing
    all documents on corresponding meta-topic
  • Perform SVD for each Meta-topic region
  • Clustor-Directed LSI Representation (CD/LSI) with
    200 features

CD/LSI
20
Representation Local LSI
Reuters Corpus
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
21
Representation Local LSI
Reuters Corpus
Government
Clustor-Directed LSI Representation (CD/LSI) w/
200 features
SVD
G O V E R N M E N T
A G R I C U L T U R E
F o r e i g n E x c h a n g e
M E T A L
E N E R G Y
Agriculture
Wool
Wheat
Barley
SVD
Foreign Exchange
Money-supply
SVD
Metal
Zinc
Gold
SVD
Energy
SVD
22
Representation Local LSI
  • Types of Local LSI
  • Term Directed representation
  • More fine-grained approach to local LSI
  • Separate representation for each topic.
  • How to construct the local region?
  • Use 100 most predictive terms for the topic.
  • Pick N most similar documents.N 5 No. of
    documents containing topic, 350 ? N ? 110
  • Final Documents in topic region N documents
    150 random documents
  • Topic-Directed LSI Representation (TD/LSI) with
    200 features

23
Representation Local LSI
Reuters Corpus
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
24
Representation Local LSI
Reuters Corpus
Term-Directed LSI Representation (TD/LSI) w/ 200
features
Wool
SVD
Wheat
SVD
Money-supply
SVD
Zinc
SVD
Barley
SVD
Gold
SVD
25
Representation Local LSI
  • Drawback of Local LSI
  • Narrower the region, the Lower flexibility in
    representations for modeling the classification
    of multiple topics
  • High computational overhead

26
Representation - Relevancy Weighting LSI
  • Use term weight to emphasize the importance of
    particular terms before applying SVD
  • IDF weighting
  • ? importance of low frequency terms
  • ? the importance of high frequency terms
  • Assumes low frequency terms to be better
    discriminators than high frequency terms

27
Representation - Relevancy Weighting LSI
  • Relevancy Weighting
  • tune the IDF assumption
  • emphasize terms in proportion to their estimated
    topic discrimination power
  • Global Relevancy Weighting of term k (GRWk)
  • Final Weighting of term k IDF2 GRWk
  • ? all low frequency terms pulled up by IDF
  • ? Poor predictors pushed down
  • ? leaving only relevant low frequency terms with
    high weights
  • Relevancy Weighted LSI Representation (REL/LSI)
    with 200 features

28
Neural Network Classifier (NN)
  • NN consists of
  • processing units (Neurons)
  • weighted links connecting neurons

29
Neural Network Classifier (NN)
  • major components of NN model
  • architecture defines the functional form
    relating input to output
  • network topology
  • unit connectivity
  • activation functions e.g. Logistic regression fn.

30
Neural Network Classifier (NN)
  • Logistic regression function
  • z
  • is a linear combination of the input features
  • p ? (0,1) - can be converted to binary
    classification method by thresholding the output
    probability

31
Neural Network Classifier (NN)
  • major components of NN model (cont)
  • search algorithm the search in weight space for
    a set of weights which minimizes the error
    between the output and the expected output
    (TRAINING PROCESS)
  • Backpropagation method
  • Mean squared errors
  • Cross-entropy error performancefunction
  • C - sum all cases and outputs
  • (dlog(y) (1-d)log(1-y) )
  • d desired output, y actual output

32
NN for Topic Spotting
  • Network outputs are estimates of the probability
    of topic presence given the feature vector of a
    document
  • Generic LSI representation
  • each network uses same representation
  • Local LSI representation
  • different representation for each network

33
NN for Topic Spotting
  • Linear NN
  • Output units with logistic activation and no
    hidden layer

34
NN for Topic Spotting
  • Non Linear NN
  • Simple networks with a single hidden layer of
    logistic sigmoid units (6 15)

35
NN for Topic Spotting
  • Flat Architecture
  • Separate network for each topic
  • use entire training set to train for each topic
  • Avoiding overfittingproblem by
  • adding penalty term to the cross-entropy cost
    function to encourage eliminationof small
    weights.
  • Early stopping based on cross-validation

36
NN for Topic Spotting
  • Modular Architecture
  • decompose learning problem into smaller problems
  • Meta-Topic Network trained on full training set
  • estimate the presence probability of the five
    topics in doc.
  • use 15 hidden units

37
NN for Topic Spotting
  • Modular Architecture
  • five groups of local topic networks
  • consists of local topic networks for each topic
    in meta-topic
  • each network trained only on the meta-topic region

38
NN for Topic Spotting
  • Modular Architecture
  • five groups of local topic networks (cont.)
  • Example wheat network trained Agriculture
    meta-topic.
  • Focus on finer distinctions, e.g. wheat and grain
  • Dont waste time on easier distinctions, e.g.
    wheat and gold.
  • Each local topic networks uses 6 hidden units.

39
NN for Topic Spotting
  • Modular Architecture
  • To compute topic predictions for a given document
  • Present document to meta-topic network
  • Present document to each of the topic networks
  • Outputs of meta-topic network ? estimate of topic
    networks final topic estimates

40
Experimental Results
  • Evaluating Performance
  • Mean squared error between actual and predicted
    values is inefficient
  • Compute precision and recall based on contingency
    table constructed over range of decision
    thresholds
  • How to get the decision Thresholds?

41
Experimental Results
  • Evaluating Performance
  • How to get the decision Thresholds?
  • Proportional assignment

Topic wool Topic ? wool
Predicted Topic wool iff Output probability ? ? output probability of kpth highest rank doc. K integer, P prior probability of wool topic
Predicted Topic ? wool, iff output probability lt ?
42
Experimental Results
  • Evaluating Performance
  • How to get the decision Thresholds?
  • fixed recall level approach
  • determine set of recall levels
  • analyze ranked documents to determine what
    decision thresholds lead to the desired set of
    recall levels.

Topic wool Topic ? wool
Predicted Topic wool iff Output probability ? ? output probability of doc. where of doc. with higher output probability Leads to desired recall level
Predicted Topic ? wool, iff output probability lt ?
Target Recall
43
Experimental Results
  • Performance by Micoraveraging
  • add all contingency tables together across topics
    at a certain threshold
  • compute precision and recall
  • used proportional assignment for picking decision
    thresholds
  • does not weight the topics evenly
  • used for comparisons to previously reported
    results
  • Breakeven point is used as a summary value

44
Experimental Results
  • Performance by Macoraveraging
  • compute precision and recall for each topic
  • take the average across topics
  • used fixed set of recall levels
  • summary values are obtained for particular topics
    by averaging precision over the 19 evenly spaced
    recall levels between 0.05 and 0.95

45
Experimental Results
  • Microaveraged performance
  • Breakpoints compared to best algorithm
  • rule induction method best on heuristic search
    with breakpoint (0.789)

0.82
0.801
0.795
0.775
46
Experimental Results
  • Macroaveraged performance
  • TERMS appears much closer to other three.
  • Relative effectivenessof the representationsat
    low recall levels isreversed at high
    recalllevels

47
  • Six techniques performance on54 most frequent
    topics
  • considerable variation of performance across
    topics
  • relative ups and downs are mirrored in both plots

Slight improvement of nonlinear networks
LSI performance degrades compared to TERMS when
ft decreases
48
Experimental Results
  • Performance of Combination of Techniques and Its
    Improvement

NN architecture Document Representation Flat Flat Modular (Meta-Topic NW trained using LSI representation) Modular (Meta-Topic NW trained using LSI representation)
NN architecture Document Representation Linear Non Linear Linear Non Linear
TERMS ? ? ? ? ? ? ? ? ?
LSI ? ? ? ? ? ? ? ?
CD-LSI ? ? ? ? ? ? ? ? ? ?
TD-LSI ? ? ? ?
REL-LSI ? ?
Hybrid (CD-LSI TERMS) ? ? ?
Match color shape to get an experiment
49
Experimental Results
  • Flat Networks

50
Experimental Results
  • Modular Networks
  • 4 clusters only used
  • Recomputed average precision for the flat
    networks

51
(No Transcript)
52
LSI representation is able to equal or exceed
TERMS performance for high frequency topic, but
performs poorly for low frequency
53
Task-Directed LSI representations improve
performance in the low frequency domain TD/LSI
Trade-off ? Cost REL/LSI Trade-off ? lower
performance on m/h topics
54
Modular CD/LSI improves performance further for
low frequency, because individual networks are
trained only in the domain that LSI was performed
55
TERMS proves to be competitive to more
sophisticated LSI technique ? most topics are
predictable by small set of terms
56
Discussion
  • Rich solution many representations and many
    models
  • Total Supervised approach
  • Results are lower than what expected
  • Is the dataset responsible?
  • High computational overhead
  • Does NN deserve a place in DM tool boxes?
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com