Naive Bayes model - PowerPoint PPT Presentation

About This Presentation
Title:

Naive Bayes model

Description:

Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai Outline Bayes probability model Naive Bayes classifier Text classification Digit classification ... – PowerPoint PPT presentation

Number of Views:358
Avg rating:3.0/5.0
Slides: 20
Provided by: twin6
Category:

less

Transcript and Presenter's Notes

Title: Naive Bayes model


1
Naive Bayes model
  • Comp221 tutorial 4 (assignment 1)
  • TA Zhang Kai

2
Outline
  • Bayes probability model
  • Naive Bayes classifier
  • Text classification
  • Digit classification
  • Assignment specifications

3
Naive Bayes classifier
  • A naive Bayes classifier is a simple
    probabilistic classifier based on applying Bayes'
    theorem with strong (naive) independence
    assumptions, or more specifically, independent
    feature model.

4
Naive Bayes probability model
  • Graphical illustration
  • a class node C at root, want P(CF1,,Fn)
  • evidence nodes F - observed features as leaves
  • conditional independence between all evidence

C

F1
F2
Fn
5
Naive Bayes probability model
  • The classifier is a conditional model
  • Following the Bayess rule strictly, we have

  • ..
  • Simplify this through conditional independence -
  • So the conditional distribution over the class C
    is
  • Z is constant given features

6
Naive Bayes classifier
  • The naive Bayes classifier combines naive Bayes
    probability model with a decision rule, such as
    the maximum a posteriori or MAP decision rule.
  • If there are k classes and if a model for p(Fi)
    can be expressed by r parameters, then the naive
    Bayes model has (k - 1) n r k parameters.

7
Text Classification
  • Task- classify text documents into one of the
    pre-defined classes such as sports, recreation,
    politics, war, economy,,etc,
  • Given
  • K groups of training texts
  • Each group with a label, containing a number of
    text documents

8
Procedures
  • Computing a priori class probabilities
  • Count the number of text documents in each
    directory/class ni
  • Total number of training text documents n
  • Prior probability P(Ci) ni / n

9
  • Computing class conditional word likelihoods
  • Suppose we have chosen m key words, denoted as
    w1, w2,,wm
  • Count the number of times cji, that word wj
    occurs in text class Ci
  • Count the number of words ni, in class Ci
  • Class conditional probaiblity is P(wj Ci) cji
    / ni

10
  • Classifying a new message d
  • Compute the features of d, i.e., the number of
    times word wj occuring in d
  • P(Cid) P(Ciw1,w2,,wd)
  • œ P(Ci)P(w1Ci) (w2Ci) (wdCi)
  • Assign d to the class I that has the maximum
    posterior probability

11
Attentions
  • Preprocessing
  • eliminating punctuation
  • eliminating numerals
  • converting all characters to lowercase
  • eliminating all words with less than 4 letters

12
  • You need to build a large vocabulary and
    separately counts how often a given word was
    encountered. The vocabulary can be built using a
    hash table.
  • How to choose the key words wis?
  • For each class, you can pick out the first k
    words that occurs most frequently
  • For all the training data, pick out the first k
    works that appears most frequently
  • Union all these words as key-words/features

13
  • Zero probabilities must be avoided (why?)
  • This occurs when one word has been encountered
    only in one class, but not others.
  • In this case the class conditional probability is
    zero
  • To prevent this, re-estimate the conditional prob
    as P(wjCi) e/ni with ni a small, tunable
    number
  • Convert all probabilities to logprobabilities
    (loglikeli-hoods) to avoid exceeding the dynamic
    range of the computer representation of real
    numbers

14
Digit Classification (assignment 1)
  • USPS data set contains normalized handwritten
    digits, scanned by the U.S. Postal Service.
  • 16 x 16 grayscale images
  • 7291 training and 2007 test observations
  • Format each line consists of the digit id (0-9)
    followed by the 256 grayscale values.
  • The test set is notoriously "difficult
  • Download it from here

15
USPS digits
16
Setting
  • Classes 09
  • Features each pixel is used as a feature, so
    there are 16 by 16, i.e., 256 features
  • Rather than pixel gray values, we can use more
    informative features, such as (detected) corners,
    crosses, slope, gravity center, etc.
  • How to quantize the real valued features.
  • Task classify new digits into one of the classes

17
Specifications
  • (preliminary, assignment 1 will come soon on
    Friday)
  • You can either use matlab or c for programming
  • If you use c, you should have created the class
    and its members/functions as required
  • If you use matlab, you should have written
    functions as required
  • Input and output format will also be fixed in the
    assignemnt

18
Files
  • Matlab file to read the USPS data
  • gtn, digit, label read_usps(path, file)
  • path c\... file usps_train.txt
  • n number of digits/images obtained
  • Digit a 16 by 16 by n matrix
  • Label label of each image
  • You may want to use it to read the USPS data

19
  • Matlab file to output a series of files
  • gtoutput(str,i1,i2)
  • str common string part
  • i1 and i2 is the starting and ending integers
  • You may want to use it to write the digits into
    separate files with the naming system you like
Write a Comment
User Comments (0)
About PowerShow.com