Naive Bayes model - PowerPoint PPT Presentation

About This Presentation

Title:

Naive Bayes model

Description:

Number of Views:358

Avg rating:3.0/5.0

Slides: 20

Provided by: twin6

Category:

Tags: bayes | classifier | model | naive

Transcript and Presenter's Notes

Title: Naive Bayes model

1
Naive Bayes model

2
Outline

3
Naive Bayes classifier

A naive Bayes classifier is a simple
probabilistic classifier based on applying Bayes'
theorem with strong (naive) independence
assumptions, or more specifically, independent
feature model.

4
Naive Bayes probability model

C

F1
F2
Fn
5
Naive Bayes probability model

6
Naive Bayes classifier

The naive Bayes classifier combines naive Bayes
probability model with a decision rule, such as
the maximum a posteriori or MAP decision rule.
If there are k classes and if a model for p(Fi)
can be expressed by r parameters, then the naive
Bayes model has (k - 1) n r k parameters.

7
Text Classification

Task- classify text documents into one of the
pre-defined classes such as sports, recreation,
politics, war, economy,,etc,
Given
K groups of training texts
Each group with a label, containing a number of
text documents

8
Procedures

11
Attentions

You need to build a large vocabulary and
separately counts how often a given word was
encountered. The vocabulary can be built using a
hash table.
How to choose the key words wis?
For each class, you can pick out the first k
words that occurs most frequently
For all the training data, pick out the first k
works that appears most frequently
Union all these words as key-words/features

Zero probabilities must be avoided (why?)
This occurs when one word has been encountered
only in one class, but not others.
In this case the class conditional probability is
zero
To prevent this, re-estimate the conditional prob
as P(wjCi) e/ni with ni a small, tunable
number
Convert all probabilities to logprobabilities
(loglikeli-hoods) to avoid exceeding the dynamic
range of the computer representation of real
numbers

14
Digit Classification (assignment 1)

USPS data set contains normalized handwritten
digits, scanned by the U.S. Postal Service.
16 x 16 grayscale images
7291 training and 2007 test observations
Format each line consists of the digit id (0-9)
followed by the 256 grayscale values.
The test set is notoriously "difficult
Download it from here

15
USPS digits
16
Setting

Classes 09
Features each pixel is used as a feature, so
there are 16 by 16, i.e., 256 features
Rather than pixel gray values, we can use more
informative features, such as (detected) corners,
crosses, slope, gravity center, etc.
How to quantize the real valued features.
Task classify new digits into one of the classes

17
Specifications

(preliminary, assignment 1 will come soon on
Friday)
You can either use matlab or c for programming
If you use c, you should have created the class
and its members/functions as required
If you use matlab, you should have written
functions as required
Input and output format will also be fixed in the
assignemnt

18
Files

Matlab file to output a series of files
gtoutput(str,i1,i2)
str common string part
i1 and i2 is the starting and ending integers
You may want to use it to write the digits into
separate files with the naming system you like

Write a Comment

User Comments (0)