Machine Learning in Text Categorization - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Machine Learning in Text Categorization

Description:

Number of Views:674

Avg rating:3.0/5.0

Slides: 23

Provided by: mger7

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning in Text Categorization

1
Machine Learning in Text Categorization

2
Overview

3
Introduction

Availability of large sets of digital data
Biological information, songs, TV broadcast
Web, Text collection and etc
Increase demand for methods to
Sort, Retrieve
Filter and manage digital resources
Information Retrieval (IR) techniques
Text search -provides tools for searching
relevant documents within this large collection.
Text classification- offers tools for converting
unstructured text collections into structured
one.
In so doing storage and search gets easier

4
Text Classification

It is important ingredient for organizing
documents.
E.g.. Web directories such Yahoo
The two variants of TC
Text clustering finds a latent but yet
unspecified group structure.
Classes are not known in advance.
Text Categorization structured the collection
according to a scheme provided as input.
Classes are predefined.

5
Text Categorization

Definition
A process of assigning to a text document d from
a given domain D one or more class labels from a
finite set of predefined categories.

d3
C6
C7
C5
C4
C1
C3
Ck
C2
...
6
Automatic Text Categorization

Formal definition
Given a set of previously unseen documents D
d1, d2, d3. and a set of pre-defined classes
or categories C c1,c2,c3. ck, a classifier
(categorizer) is a function ? that maps a
document from set D to the set of all subsets of
C.
Gains in efficiency and man power by automation
may significantly aid the business process of any
organizations.

7
Automatic Text Categorization

8
Application of TC

Document organization
Newswire filtering grouping of news stories
produced by news agencies to thematic classes of
interest.
Patent classification
Text Filtering
Classifying incoming stream of documents
Spam e-mail filtering
Single-label TC
Classifying into two disjoint categories
relevant and irrelevant
Authorship attribution and etc.

9
Construction of Automatic Classifiers

Starting from late 80s to early 90s, there have
been two popular approaches.
Knowledge engineering
Define rules to classify documents
All the rules are defined manually.
Supervised machine learning
building an automatic text classifier by learning
the characteristics of the categories from
training set.
saves a lot of time and skilled manpower.
Classification accuracy is better than that of
classifiers built by knowledge engineering
methods.

10
Automatic Text Categorization Process

To build a categorizer, we need a set of
pre-classified documents.
Training set pre-classified documents for
training the categorizer.
Test set - pre-classified documents used for
testing the effectiveness of the classifier
Basic outline for construction

11
Automatic Text Categorization Phase-1

Building internal representations for documents
Transforms document d to compact vector form.
Tokenization of documents
Dimension of the vector corresponds to number of
distinct words or tokens in the training set.
Each entry in the vector represents the weight of
each term. A document di f1i.fmi where m
dimension
The tfdif function (tf idf )
The weights are normalized to limit them to 0,
1.
Dimensionality reduction methods
Feature selection
Feature extraction

12
Automatic Text Categorization Phase-2

After phase1, a learner automatically builds a
text classifier for provided categories by
observing pre-classified samples in the training
set.
Various machine learning techniques
Naïve Bayesian
Support Vector Machine
K-Nearest-Neighbors
Neural Network
Decision Tress and etc.

13
Machine Learning Algorithm

14
Machine Learning Algorithms

Support Vector Machines (SVM)
Uses Polynomial and RBF kernels
SVMlight to decompose big QP into smaller one and
solve it iteratively until solution converges.
SVM classifier for each category k(k-1)/2
One- to-one approach
Training scales poorly with training set size
K-Nearest-Neighbor (KNN)
Lazy learner no offline learning phase
A class with k most similar document to a new one
is assigned.
Simple but computationally intensive
Slow categorization phase

15
Evaluation of Text Categorizers

Evaluation measures
Effectiveness ability to take the right
classification decision.
Efficiency
Training efficiency - average time it takes to
build a classifier for a category from a training
set
Classification efficiency -average time it takes
to classify previously unseen documents.
Measures of TC effectiveness
Precision p -the probability that dj is
classified under ci given that it was supposed to
be classified under ci
Recall ? - the probability that dj was supposed
to be classified under ci given that it is
classified under ci

16
Evaluation of Text Categorizers

How do we estimate precision and recall ?
Contingency table that summarizes the
categorization result for given category.
Classifier should be evaluated by a measure that
combines precision and recall.
Breakeven point- value at which precision
recall
Microaveraging computes recall and precisions
for all categories or for most frequent
categories

17
Experimental Design(Joachim's,1997)