A Suffix Tree Approach to Text Classification Applied to Email Filtering - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

A Suffix Tree Approach to Text Classification Applied to Email Filtering

Description:

– PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 47
Provided by: dcsB
Category:

less

Transcript and Presenter's Notes

Title: A Suffix Tree Approach to Text Classification Applied to Email Filtering


1
A Suffix Tree Approach to Text Classification
Applied to Email Filtering
  • Rajesh Pampapathi, Boris Mirkin, Mark Levene

School of Computer Science and Information
Systems Birkbeck College, University of London
2
Introduction Outline
  • Motivation Examples of Spam
  • Suffix Tree construction
  • Document scoring and classification
  • Experiments and results
  • Conclusion

3
1. Standard spam mail
  • Buy cheap medications online, no prescription
    needed.
  • We have Viagra, Pherentermine, Levitra, Soma,
    Ambien, Tramadol and many more products.
  • No embarrasing trips to the doctor, get it
    delivered directly to your door.
  • Experienced reliable service.
  • Most trusted name brands.
  • For your solution click here http//www.webrx-doc
    tor.com/?rid1000

4
5. Embedded message (plus word salad)
  • zygotes zoogenous zoometric zygosphene zygotactic
    zygoid zucchettos zymolysis zoopathy
    zygophyllaceous zoophytologist zygomaticoauricular
    zoogeologist zymoid zoophytish zoospores
    zygomaticotemporal zoogonous zygotenes zoogony
    zymosis zuza zoomorphs zythum zoonitic zyzzyva
    zoophobes zygotactic zoogenous zombies zoogrpahy
    zoneless zoonic zoom zoosporic zoolatrous
    zoophilous zymotically zymosterol
  • FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUB
    IYYXFN
  • GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX
    lthttp//healthygrow.biz/index.php?id2gt
  • zonally zooidal zoospermia zoning zoonosology
    zooplankton zoochemical zoogloeal zoological
    zoologist zooid zoosphere zoochemical
  • Safezoonal andNGASXHBPnatural
  • TestedQLOLNYQandEAVMGFCapproved
  • zonelike zoophytes zoroastrians zonular zoogloeic
    zoris zygophore zoograft zoophiles zonulas
    zygotic zymograms zygotene zootomical zymes
    zoodendrium zygomata zoometries zoographist
    zygophoric zoosporangium zygotes zumatic
    zygomaticus zorillas zoocurrent zooxanthella
    zyzzyvas zoophobia zygodactylism zygotenes
    zoopathological noZFYFEPBmas lthttp//healthygrow.b
    iz/remove.phpgt

5
4. Word salads
  • Buy meds online and get it shipped to your door
    Find out more here
  • lthttp//www.gowebrx.com/?rid1001gt
  • a publications website accepted definition. known
    are can Commons the be definition. Commons UK
    great public principal work Pre-Budget but an can
    Majesty's many contains statements statements
    titles (eg includes have website. health, these
    Committee Select undertaken described may
    publications

6
Creating a Suffix Tree
MEET
FEET
7
Levels of Information
  • Characters the alphabet (and their frequencies)
    of a class.
  • Matches between query strings and a class.
  • s nviaXgraUgtTablets
  • t xviagraTablets
  • Matches(s, t) v, ia, gra, Tab, l, ets,
  • - But what about overlapping matches?
  • Trees properties of the class as a whole.
  • size
  • density (complexity)

8
Document Similarity Measure
The score for a document, d, is the sum of the
scores for each suffix
d(i) is the suffix of d beginning at the ith
letter tau is a tree normalisation coefficient
9
Substring Similarity Measure
Score for match, m m0m1m2mn, is score(m)
T is the tree profile of the class. v(mT) is a
normalisation coefficient based on the properties
of T. p(mt) is the probability of the character,
mt, of the match m. Fp is a significance
function.
10
Decision Mechanism
11
Specifications of Fp(character level)
Note Logit and Sigmoid need to be adjusted to
fit in the range 0,1
12
Significance function
13
Threshold Variation Significance functions
14
Threshold Variation Significance functions
15
Match normalisation
m is the set of all strings formed by
permutations of m m is the set of all strings of
length equal to length of m
16
Match normalisation
MUN match unnormalised MPN permutation
normalised MLN length normalised
17
Threshold Variation match normalisation
Constant significance functionunnormalised
Constant significance functionmatch normalised
18
Specifications of tau
19
Tree normalisation
20
Androutsopoulos et al. (2000) Ling-Spam Corpus
21
Ling-BKS Corpus
SpamAssassin Corpus
22
Conclusions
  • Good overall classifier- improvement on naïve
    Bayes- but theres still room for improvement
  • Can one method ever maintain 100 accuracy?
  • Extending the classifier
  • Applications to other domains- web page
    classification

23
Future Work - ODP
24
Computational Performance
25
Experimental Data Sets
  • Ling-Spam (LS)Spam (481) collected by
    Androutsopoulos et al. Ham (2412) from online
    linguists bulletin board
  • Spam Assassin- Easy (SAe)- Hard (SAh)Spam
    (1876) and ham (4176) examples donated
  • BBKSpam (652) collected by Birkbeck

26
Androutsopoulos et al. (2000) Ling-Spam Corpus
27
Androutsopoulos et al. (2000) Ling-Spam Corpus
28
SpamAssassin Corpus
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Vector Space Model
What then? sang Platos ghost, What then?
W. B. Yeats
Word Probability
0.05
50/1000
P(w what)
33
Creating Profiles
Mark
34
Profiles
Mark Levene
data
databases
information
search
engines
Mike Hu
data
intelligence
criminal
computational
police
35
Classification
SBM
SML
SMH
36
Naïve Bayes(similarity measure)
For a document d d1d2d3 dm and set of
classes c c1, c2 ... cJ
(1)
Where
(2)
(3)
37
Criticisms
  • Pre-processing- Stop-word removal- Word
    stemming/lemmatisation- Punctuation and
    formatting
  • Smallest unit of consideration is a word.
  • Classes (and documents) are bags of words, i.e.
    each word is independent of all others.

38
Word Dependencies
Boris Mirkin
data
intelligence
clustering
computational
means
Mike Hu
data
intelligence
criminal
computational
means
39
Word Inflections
Intelligent
Intelligence
Intelligentsia
Intelligible
40
Success measures
  • Recall is the proportion of correctly classified
    examples of a class. If SR is spam recall, then
    (1-SR) gives the proportion of false negatives.
  • Precision is the proportion assigned to a class
    which are true members of that class. It is a
    measure of the number of true positives. If SP
    is spam precision, then (1 SP) would give the
    proportion of false positives.

41
Success measures
  • True Positive Rate (TPR) is the proportion of
    correctly classified examples of the positive
    class. Spam is typically taken as the positive
    class, so TPR is then the number of spam
    classified as spam over the total number of spam.
  • False Positive Rate (FPR) is the proportion of
    the negatve class erroneously assigned to the
    positive class.
  • Ham is typically taken as the negative class, so
    FPR is then the number of ham classified as spam
    over the total number of ham.

42
Classifier Structure
Spam
Ham
  • Training Data
  • Profiling Method
  • Profile Representation
  • Similarity/Comparison Measure
  • Decision Mechanism or Classification Criterion
  • Decision

?
Ham
Spam
43
Classification using a suffix tree
  • Method of profiling is construction of the
    tree(no pre-processing, no post-processing)
  • The tree is a profile of the class.
  • Similarity measure?
  • Decision mechanism?

44
Threshold Variation match normalisation
Constant significance functionunnormalised
Constant significance functionmatch normalised
SPE spam precision error HPE ham precision
error
45
Threshold Variation Significance functions
Root function, no normalisation
Logit function, no normalisation
SPE spam precision error HPE ham precision
error
46
Threshold Variation
Constant significance function(unnormalised)
SPE spam precision error HPE ham precision
error
Write a Comment
User Comments (0)
About PowerShow.com