Bayesian Spam Filters - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian Spam Filters

Description:

Step 2: Find the probability that the message has the word 'Viagra' in it and is not spam. ... Since r(Viagra) is greater than the threshold of 0.9, we can ... – PowerPoint PPT presentation

Number of Views:755
Avg rating:3.0/5.0
Slides: 23
Provided by: Jam953
Learn more at: https://www.cse.unr.edu
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Spam Filters


1
Bayesian Spam Filters
  • Key Concepts
  • Conditional Probability
  • Independence
  • Bayes Theorem

2
Spam or Ham?
  • FROM Terry Delaney removed
  • TO (removed)
  • Subject FDA approved on-line pharmacies! click
  • here (removed)
  • Chose your product and site below
  • Canadian pharmacy (removed) - Cialis Soft Tabs -
    5.78, Viagra Professional - 4.07, Soma - 1.38,
    Human Growth Hormone - 43.37, Meridia - 3.32,
    Tramadol - 2.17, Levitra - 11.97.

3
Quick Reminders
  • Conditional Probability Events E, F with
  • Independence E and F are independent if and only
    if

4
Bayes Theorem A quick Proof


5
Proof cont.
6
Applying Bayes Theorem
  • Let our sample space be the set of emails.
  • Let S be the event a message is spam hence
    is the event a message is not spam
  • Let E be the event a message contains a word w.

7
Estimations
8
Estimation Continued
9
Spam based on single words?
  • Probabilities based on single words Bad Idea
  • False positives AND false negatives aplenty
  • Calculate based on n words, assuming each event
    EiS (EiSC) is independent P(S) P(SC).

10
Final Approximation
11
How do we use this?
  • User must train the filter based on messages in
    his/her inbox to estimate probabilities
  • The program or user must define a threshold
    probability r
  • If , the message is considered
    spam.

12
Example
  • Suppose the filter has the following data
  • Threshold Probability .9
  • Viagra occurs in 250 of 2000 spam messages
  • Viagra occurs in only 5 of 1000 non-spam
    messages
  • Lets try to estimate the probability, using the
    process we just defined

13
Example Cont.
  • Step 1 Find the probability that the message has
    the word Viagra in it and is spam.
  • p(Viagra) 250 / 2000 0.125
  • Step 2 Find the probability that the message has
    the word Viagra in it and is not spam.
  • q(Viagra) 5 / 1000 0.005

14
Example Cont.
  • Since we are assuming that it is equally likely
    that an incoming message is or is not spam, we
    can estimate the probability with this equation
  • r(Viagra) p(Viagra)
  • p(Viagra) q(Viagra)

15
Example Cont.
  • 0.125
  • 0.125 0.005
  • 0.125
  • 0.130
  • 0.962
  • Since r(Viagra) is greater than the threshold of
    0.9, we can reject this message as spam.

16
Harder Stuff
  • Single-word detection can lead to a lot of false
    positives and false negatives.
  • To counter this, most spam filters look for the
    presence of multiple words.

17
Another Example
  • 2000 Spam messages 1000 real messages
  • Viagra appears in 400 spam messages
  • Viagra appears in 60 real messages
  • Cialis appears in 200 spam and 25 real messages
  • Threshold Probability .9
  • Lets calculate the probability that its spam.

18
Example Cont.
  • Step 1 Find the probability that the message has
    the word Viagra in it and is spam.
  • p(Viagra) 400 / 2000 0.2
  • Step 2 Find the probability that the message has
    the word Viagra and is not spam.
  • q(Viagra) 60 / 1000 0.06

19
Example Cont.
  • Step 3 Find the probability that the message
    contains the word Cialis and is spam.
  • p(Cialis) 200 / 2000 0.1
  • Step 4 Find the probability that the message
    contains the word Cialis and is not spam.
  • q(Cialis) 25 / 1000 0.025

20
Example Cont
  • Using our approximation, we have
  • r(Viagra,Cialis) p(Viagra)
    p(Cialis)
  • p(Viagra) p(Cialis) q(Viagra)
    q(Cialis)

21
Example Cont.
  • r(Viagra,Cialis) (0.2)(0.1)
  • (0.2)(0.1)
    (0.6)(0.025)
  • 0.930
  • This message will be rejected however since we
    set the threshold probability at 0.9.

22
Questions?
Write a Comment
User Comments (0)
About PowerShow.com