A Bayesian Approach to filter Junk E-Mail - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

A Bayesian Approach to filter Junk E-Mail

Description:

Have to present on 16.11.2004 ... Yasir IQBAL Master Student in Computer Science Universit t des Saarlandes – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 29
Provided by: coliUnisa
Category:

less

Transcript and Presenter's Notes

Title: A Bayesian Approach to filter Junk E-Mail


1
A Bayesian Approach to filter Junk E-Mail
  • Yasir IQBAL
  • Master Student in Computer Science
  • Universität des Saarlandes

Seminar Classification and Clustering methods
for Computational Linguistics
16.11.2004
2
Presentation Overview
  • Problem description (Whats Spam problem?)
  • Classification problem
  • Naïve Bayes Classifier
  • Logical system view
  • Features selection, representation
  • Results
  • Precision and Recall
  • Discussion

3
Spam/junk/bulk Emails
  • The messages you spend your time to throw out
  • Spam do not want to get, unsolicited messages
  • Junk irrelevant to the recipient, unwanted
  • Bulk mass mailing for business marketing (or
    fill-up mailbox etc)

4
Problem examples
  • You have won!!!!, you are almost winner of ...
  • Viagra, generic Viagra available order now
  • Your order, your item have to be hipped
  • Lose your weight, no subscription required
  • Assistance required, an amount of million 25
    US
  • Get login and password now, age above 18
  • Check this, hi, your document has error
  • Download it, free celebrity wallpapers download

5
Motivation
  • Who? and how should one decide what is Spam?
  • How to get rid of this Spam automatically?
  • Because Time is money
  • and offensive material in such emails

What are the computers for? Let them work
6
How to fight? (techniques)
  • Rule based filtering of emails
  • if SENDER contains schacht ACTIONINBOX
  • if SUBJECT contains Win ACTIONDELETE
  • if BODY contains Viagra ACTIONDELETE
  • Problems static rules, language dependent, how
    many rules, and who should define them?
  • Statistical filter (classifier) based on message
    attributes
  • Decision Trees
  • Support Vector Machines
  • Naïve Bayes Classifier (Well discuss)
  • Problems when no features can be extracted???
    Error loss?

7
Classification tasks
  • These are few other classification tasks
  • Text Classification (like the mail message)
  • Content management, information retrieval
  • Document classification
  • Same like text classification
  • Speech recognitions
  • what do you mean? ? yeh you understand )!
  • Named Entity Recognition
  • Reading and Bath Cities or simple verbs?
  • Biometric sensors for authentication
  • fingerprints, face to identify someone

8
Training methods
  • Offline learning
  • some training data, prepared manually, with
    annotation (used to train the system before test)
  • ltemail typespamgthi, have you thought online
    credit?lt/emailgt
  • ltemail typenormal_emailgtSoha! sorry cannot
    reach at 1800lt/emailgt
  • Online learning
  • At run-time user increases knowledge of the
    system by a kind of feedback to the given
    decision.
  • Example We can click on Spam or Not Spam in
    Yahoo mail service.

9
Yahoo Mail (Online learning)
10
Model overview
  • Flow (training/test)
  • Steps
  • Training data of annotated emails
  • Emails annotated
  • A set of classes
  • In our case two possible classes
  • Can further be personalized
  • Feature extraction (text etc)
  • Tokenization
  • Domain specific features
  • Most often features to be selected
  • Classify (each message/email)
  • Calculate posterior probabilities
  • Evaluate results (precision/recall)

training
test
Features extraction selection
classify
Email?
Spam?
Evaluate
11
Message attributes (features)
  • These are the indicators for classification the
    messages into legitimate or Spam
  • Features of the email messages
  • Words (tokens) free, win, online, enlarge,
    weight, money, offer
  • Phrases FREE!!!, only , order now
  • Special characters pecial, grea8, V i a g r
    a
  • Mail headers sender name, to and from email
    address, domain name / IP address,

12
Feature vector matrix (binary variables)
Words/phrases as features, 1 if the feature
exists, otherwise 0
13
Feature Selection
  • How to select most prominent features?
  • Words/Tokens, phrases, header information
  • Text of the email, HTML messages, header fields,
    email address
  • Removing insignificant features
  • Calculate the mutual information between each
    feature and the class

14
Conditional probability
  • Probability of an event B while given an observed
    event A
  • P(B A) P(AB) P(B) / P(A)
  • Probability of even A must be gt 0 (must have
    occurred)

SPAM
EMAIL
B
A
Feature set
P that A and B occurred together
Calculate P that these features belong to SPAM or
EMAIL class
15
How to apply to the problem?
  • When Xx1, x2, x3, x4xn is a feature vector
  • a set of feature (feature vector), Xonline,
    credit, now!!!Zinc
  • Cc1, c2, c3, c4ck is a set of classes
  • in our case only two classes i.e. CSPAM,
    LEGITIMATE.
  • P(Cck X x) P(Xx Cck ) P(Cck ) /
    P(Xx)
  • assumption is made that each feature is
    independent from other
  • P(SPAM online credit )
  • P(onlineSPAM) P(creditSPAM) P(SPAM)
    P(SPAM) /
  • P(online) P(credit) P()

16
Classification (Naïve Bayes)
  • P(CSPAM x1,x2,x3xn) P(x1,x2,x3xn
    CSPAM)P(SPAM) / P(x1,x2,x3xn)
  • Prior probability
  • Let us say we observe 35 of emails as junk/spam
  • P(SPAM)0.35 and P(LEGITIMATE)0.65
  • Posterior probability (for Spam)
  • Is conditional probability of certain features in
    certain class P(x1,x2,x3xn CSPAM) assumption
    of independence of features

17
Classifier
  • Finally we classify
  • If the mail is a Spam?
  • P(SPAM X) / P(LEGITIMATE X) gt ?
  • Choice of ? depends on the cost we imply on
    misclassification (as a threshold)
  • Whats cost?
  • Classifying an important email as SPAM is worse
  • Classifying a SPAM as email is not that worst!

18
Experiments
  • Used features selection to decrease dimensions of
    features/data
  • Corpus of 1789 actual (1578 junks, 211
    legitimate)
  • Features from the text tokens
  • removed too rare tokens
  • Added about 35 hand-crafted phrase features
  • 20 non textual domain specific features
  • Non-Alphanumeric characters and percentage of
    numeric were handy indicators
  • Top 500 features according to Mutual Info between
    classes and features (greater this value )

19
Evaluation?
  • How to know how good is our classifier?
  • Calculate precision and recall!
  • Precision is percentage of emails classified as
    SPAM, that are in fact SPAM
  • Recall is percentage of all SPAM emails that are
    correctly classified as SPAM

1
Ideal precision/recall curve
Precision
0
1
Recall
20
Results
21
Conclusion
  • It is very successful to use automatic filter
  • Hand-crafted features enhance performance
  • Success in this problem is confirmation that the
    technique can be used in other text
    categorization tasks.
  • Spam filter could be enhanced to classify other
    types of emails like business, friends
    (subclasses).

22
Discussion
  • What are we classifying? (Objects)
  • What are the features?
  • What could be the features?
  • Bayesian classification
  • Strong and weak points
  • Possible improvements?
  • Why Bayesian instead of other methods?
  • What are the questionable assumptions?
  • Subclasses?
  • How to control error loss?
  • When a normal email is moved to trash or a junk
    mail in the inbox?

23
Merci, Danke, Muchas Gracias, Aciu, ?????
  • All of you are very patient, thank you!
  • Special thanks to Irene
  • For such an opportunity to talk about
    classification
  • Her hard work help for me to prepare this talk
  • Thanks Sabine, Stefan for conducting this seminar
  • Thank you (for support)
  • Imran Rauf http//www.mpi-sb.mpg.de/irauf/
  • Habib Ur Rahman (???? ??? ??? ??)
  • and now?maybe thanks to spammers also ?

24
References
  • Sahami et al " A Bayesian Approach to Filtering
    Junk E-Mail 1998
  • Manning, Schütze Foundations of Statistical
    Natural Language Processing, 2000.

25
Extra slides
  • ) oder ? ?

26
  • What are we classifying? (Objects)
  • Emails (to be classified as normal or Spam)
  • What are the features? indicator for any class
  • What could be the features? words, phrases,
    headers
  • Naïve Bayesian classification
  • Strong and weak points
  • High throughput, simple calculation,
  • Assumption of independent features might not
    always hold truth
  • Possible improvements?
  • ??? Detect feature dependency
  • Why Bayesian instead of other methods?
  • See strong points

27
  • What are the questionable assumptions?
  • features are independent of each other
  • Subclasses?
  • Emails could be classified in subclasses
  • SPAM ? PORNO_SPAM, BUSINESS_SPAM
  • LEGITIMATE ? BUSINESS, APPOINTMENTS.. etc
  • How to control error loss?
  • When a normal email is moved to trash or a junk
    mail in the inbox?

28
Bayesian networks
CLASS
CLASS
x1
x2
x3
...
xn
x1
x2
x3
...
xn
-nodes influence parent -features are independent
-nodes influence parent siblings -features are
dependent
Write a Comment
User Comments (0)
About PowerShow.com