Title: A Bayesian Approach to filter Junk E-Mail
1A Bayesian Approach to filter Junk E-Mail
- Yasir IQBAL
- Master Student in Computer Science
- Universität des Saarlandes
Seminar Classification and Clustering methods
for Computational Linguistics
16.11.2004
2Presentation Overview
- Problem description (Whats Spam problem?)
- Classification problem
- Naïve Bayes Classifier
- Logical system view
- Features selection, representation
- Results
- Precision and Recall
- Discussion
3Spam/junk/bulk Emails
- The messages you spend your time to throw out
- Spam do not want to get, unsolicited messages
- Junk irrelevant to the recipient, unwanted
- Bulk mass mailing for business marketing (or
fill-up mailbox etc)
4Problem examples
- You have won!!!!, you are almost winner of ...
- Viagra, generic Viagra available order now
- Your order, your item have to be hipped
- Lose your weight, no subscription required
- Assistance required, an amount of million 25
US - Get login and password now, age above 18
- Check this, hi, your document has error
- Download it, free celebrity wallpapers download
5Motivation
- Who? and how should one decide what is Spam?
- How to get rid of this Spam automatically?
- Because Time is money
- and offensive material in such emails
What are the computers for? Let them work
6How to fight? (techniques)
- Rule based filtering of emails
- if SENDER contains schacht ACTIONINBOX
- if SUBJECT contains Win ACTIONDELETE
- if BODY contains Viagra ACTIONDELETE
- Problems static rules, language dependent, how
many rules, and who should define them? - Statistical filter (classifier) based on message
attributes - Decision Trees
- Support Vector Machines
- Naïve Bayes Classifier (Well discuss)
- Problems when no features can be extracted???
Error loss?
7Classification tasks
- These are few other classification tasks
- Text Classification (like the mail message)
- Content management, information retrieval
- Document classification
- Same like text classification
- Speech recognitions
- what do you mean? ? yeh you understand )!
- Named Entity Recognition
- Reading and Bath Cities or simple verbs?
- Biometric sensors for authentication
- fingerprints, face to identify someone
8Training methods
- Offline learning
- some training data, prepared manually, with
annotation (used to train the system before test) - ltemail typespamgthi, have you thought online
credit?lt/emailgt - ltemail typenormal_emailgtSoha! sorry cannot
reach at 1800lt/emailgt - Online learning
- At run-time user increases knowledge of the
system by a kind of feedback to the given
decision. - Example We can click on Spam or Not Spam in
Yahoo mail service.
9Yahoo Mail (Online learning)
10Model overview
- Steps
- Training data of annotated emails
- Emails annotated
- A set of classes
- In our case two possible classes
- Can further be personalized
- Feature extraction (text etc)
- Tokenization
- Domain specific features
- Most often features to be selected
- Classify (each message/email)
- Calculate posterior probabilities
- Evaluate results (precision/recall)
training
test
Features extraction selection
classify
Email?
Spam?
Evaluate
11Message attributes (features)
- These are the indicators for classification the
messages into legitimate or Spam - Features of the email messages
- Words (tokens) free, win, online, enlarge,
weight, money, offer - Phrases FREE!!!, only , order now
- Special characters pecial, grea8, V i a g r
a - Mail headers sender name, to and from email
address, domain name / IP address,
12Feature vector matrix (binary variables)
Words/phrases as features, 1 if the feature
exists, otherwise 0
13Feature Selection
- How to select most prominent features?
- Words/Tokens, phrases, header information
- Text of the email, HTML messages, header fields,
email address - Removing insignificant features
- Calculate the mutual information between each
feature and the class
14Conditional probability
- Probability of an event B while given an observed
event A - P(B A) P(AB) P(B) / P(A)
- Probability of even A must be gt 0 (must have
occurred)
SPAM
EMAIL
B
A
Feature set
P that A and B occurred together
Calculate P that these features belong to SPAM or
EMAIL class
15How to apply to the problem?
- When Xx1, x2, x3, x4xn is a feature vector
- a set of feature (feature vector), Xonline,
credit, now!!!Zinc - Cc1, c2, c3, c4ck is a set of classes
- in our case only two classes i.e. CSPAM,
LEGITIMATE. -
- P(Cck X x) P(Xx Cck ) P(Cck ) /
P(Xx) - assumption is made that each feature is
independent from other - P(SPAM online credit )
- P(onlineSPAM) P(creditSPAM) P(SPAM)
P(SPAM) / - P(online) P(credit) P()
16Classification (Naïve Bayes)
- P(CSPAM x1,x2,x3xn) P(x1,x2,x3xn
CSPAM)P(SPAM) / P(x1,x2,x3xn) - Prior probability
- Let us say we observe 35 of emails as junk/spam
- P(SPAM)0.35 and P(LEGITIMATE)0.65
- Posterior probability (for Spam)
- Is conditional probability of certain features in
certain class P(x1,x2,x3xn CSPAM) assumption
of independence of features
17Classifier
- Finally we classify
- If the mail is a Spam?
- P(SPAM X) / P(LEGITIMATE X) gt ?
- Choice of ? depends on the cost we imply on
misclassification (as a threshold) - Whats cost?
- Classifying an important email as SPAM is worse
- Classifying a SPAM as email is not that worst!
18Experiments
- Used features selection to decrease dimensions of
features/data - Corpus of 1789 actual (1578 junks, 211
legitimate) - Features from the text tokens
- removed too rare tokens
- Added about 35 hand-crafted phrase features
- 20 non textual domain specific features
- Non-Alphanumeric characters and percentage of
numeric were handy indicators - Top 500 features according to Mutual Info between
classes and features (greater this value )
19Evaluation?
- How to know how good is our classifier?
- Calculate precision and recall!
- Precision is percentage of emails classified as
SPAM, that are in fact SPAM - Recall is percentage of all SPAM emails that are
correctly classified as SPAM
1
Ideal precision/recall curve
Precision
0
1
Recall
20Results
21Conclusion
- It is very successful to use automatic filter
- Hand-crafted features enhance performance
- Success in this problem is confirmation that the
technique can be used in other text
categorization tasks. - Spam filter could be enhanced to classify other
types of emails like business, friends
(subclasses).
22Discussion
- What are we classifying? (Objects)
- What are the features?
- What could be the features?
- Bayesian classification
- Strong and weak points
- Possible improvements?
- Why Bayesian instead of other methods?
- What are the questionable assumptions?
- Subclasses?
- How to control error loss?
- When a normal email is moved to trash or a junk
mail in the inbox?
23Merci, Danke, Muchas Gracias, Aciu, ?????
- All of you are very patient, thank you!
- Special thanks to Irene
- For such an opportunity to talk about
classification - Her hard work help for me to prepare this talk
- Thanks Sabine, Stefan for conducting this seminar
- Thank you (for support)
- Imran Rauf http//www.mpi-sb.mpg.de/irauf/
- Habib Ur Rahman (???? ??? ??? ??)
- and now?maybe thanks to spammers also ?
24References
- Sahami et al " A Bayesian Approach to Filtering
Junk E-Mail 1998 - Manning, Schütze Foundations of Statistical
Natural Language Processing, 2000.
25Extra slides
26- What are we classifying? (Objects)
- Emails (to be classified as normal or Spam)
- What are the features? indicator for any class
- What could be the features? words, phrases,
headers - Naïve Bayesian classification
- Strong and weak points
- High throughput, simple calculation,
- Assumption of independent features might not
always hold truth - Possible improvements?
- ??? Detect feature dependency
- Why Bayesian instead of other methods?
- See strong points
27- What are the questionable assumptions?
- features are independent of each other
- Subclasses?
- Emails could be classified in subclasses
- SPAM ? PORNO_SPAM, BUSINESS_SPAM
- LEGITIMATE ? BUSINESS, APPOINTMENTS.. etc
- How to control error loss?
- When a normal email is moved to trash or a junk
mail in the inbox?
28Bayesian networks
CLASS
CLASS
x1
x2
x3
...
xn
x1
x2
x3
...
xn
-nodes influence parent -features are independent
-nodes influence parent siblings -features are
dependent