Spam Email Filter Project - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Spam Email Filter Project

Description:

Yahoo Email System's Integrated Bulk Email Filter. More... Our Approaches. Data Source ... Convert each email into a binary vector ... – PowerPoint PPT presentation

Number of Views:3401
Avg rating:3.0/5.0
Slides: 13
Provided by: tong46
Category:
Tags: email | filter | project | spam | yahoo

less

Transcript and Presenter's Notes

Title: Spam Email Filter Project


1
Spam Email Filter Project
  • A comparison of three approaches Naives Bayesian
    Approach, Decision Tree, and Neural Network

2
Group Member
  • Duhong Chen
  • Tongjie Chen
  • Hua Ming

3
What is Spam Email?
  • Commonly referred to as junk email or
    unsolicited commercial email.
  • Why we need to filter spam email?
  • Undesired use of screen space, bandwidth, storage
    space.
  • Interferes with efficient delivery of legitimate
    messages.

4
How to get rid of Spam Email?
  • Block (Allow) List
  • too restrictive
  • Manual Deletion
  • time consuming, inefficient
  • Automatic Deletion
  • content based

5
Existing Software
  • Spam Inspector
  • Spam Eater
  • EmailProtect
  • Yahoo Email Systems Integrated Bulk Email Filter
  • More

6
Our Approaches
  • Data Source
  • Spam email from website http//www.em.ca/bruceg/s
    pam/
  • Some Personal Emails from ISU mailbox.
  • Preprocess
  • Select some frequent words as attributes
  • Convert each email into a binary vector
  • Assign legitimate or spam to each email as
    class label
  • Training and Testing
  • Using 3 approaches to do training and testing
    respectively on
  • the same training set and testing set
  • Performance Evaluation

7
Pre-Process
  • Strip out all the HTML and XML tag and
  • get the meaningful text.
  • Get all the tokens in subject and content of
    email and use them for potential attribute.

8
Pre-Process (cont)
  • How to select a meaningful key word as
    attribute, e.g. select the words with high mutual
    information (MI)

9
Training and Testing
  • About 2000 spam emails and about 200 legitimate
    emails as training set
  • About 100 spam emails and about 100 legitimate
    emails as testing set
  • Using Naïve Bayesian, Decision Tree, and Neural
    Network to learn and classify above data sets.

10
Performance Evaluation
  • Comparing the percentage of emails that are
    correctly (incorrectly) classified
  • Classifying a spam email as legitimate email
    is more tolerable than classifying a legitimate
    email as spam. So we assign different weights
    to 2 different misclassifyings.

11
Required Java package
  • activation.jar
  • mail.jar
  • These two are available from
  • http//www.java.sun.com

12
Reference
  • Learning to filter spam Email A comparison of a
    Naïve Bayesian and a Memory-Based Approach. Ion
    Androutsopoulos, Georgios paliouras, Vangelis
    Karkaletsis, etc. 2000
  • An Evaluation of Naïve Bayesian Anti-spam
    Filtering, Ion Androutsopoulos, John Koutsias,
    etc. 2001
  • Automatic Introduction of Rules for Email
    classification. Elisabeth Crawford and Judy Kay,
    2001
Write a Comment
User Comments (0)
About PowerShow.com