Title: A Suffix Tree Approach to Text Classification Applied to Email Filtering
1A Suffix Tree Approach to Text Classification
Applied to Email Filtering
- Rajesh Pampapathi, Boris Mirkin, Mark Levene
School of Computer Science and Information
Systems Birkbeck College, University of London
2Introduction Outline
- Motivation Examples of Spam
- Suffix Tree construction
- Document scoring and classification
- Experiments and results
- Conclusion
31. Standard spam mail
- Buy cheap medications online, no prescription
needed. - We have Viagra, Pherentermine, Levitra, Soma,
Ambien, Tramadol and many more products. - No embarrasing trips to the doctor, get it
delivered directly to your door. - Experienced reliable service.
- Most trusted name brands.
- For your solution click here http//www.webrx-doc
tor.com/?rid1000
45. Embedded message (plus word salad)
- zygotes zoogenous zoometric zygosphene zygotactic
zygoid zucchettos zymolysis zoopathy
zygophyllaceous zoophytologist zygomaticoauricular
zoogeologist zymoid zoophytish zoospores
zygomaticotemporal zoogonous zygotenes zoogony
zymosis zuza zoomorphs zythum zoonitic zyzzyva
zoophobes zygotactic zoogenous zombies zoogrpahy
zoneless zoonic zoom zoosporic zoolatrous
zoophilous zymotically zymosterol - FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUB
IYYXFN - GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX
lthttp//healthygrow.biz/index.php?id2gt - zonally zooidal zoospermia zoning zoonosology
zooplankton zoochemical zoogloeal zoological
zoologist zooid zoosphere zoochemical - Safezoonal andNGASXHBPnatural
- TestedQLOLNYQandEAVMGFCapproved
- zonelike zoophytes zoroastrians zonular zoogloeic
zoris zygophore zoograft zoophiles zonulas
zygotic zymograms zygotene zootomical zymes
zoodendrium zygomata zoometries zoographist
zygophoric zoosporangium zygotes zumatic
zygomaticus zorillas zoocurrent zooxanthella
zyzzyvas zoophobia zygodactylism zygotenes
zoopathological noZFYFEPBmas lthttp//healthygrow.b
iz/remove.phpgt
54. Word salads
- Buy meds online and get it shipped to your door
Find out more here - lthttp//www.gowebrx.com/?rid1001gt
- a publications website accepted definition. known
are can Commons the be definition. Commons UK
great public principal work Pre-Budget but an can
Majesty's many contains statements statements
titles (eg includes have website. health, these
Committee Select undertaken described may
publications
6 Creating a Suffix Tree
MEET
FEET
7Levels of Information
- Characters the alphabet (and their frequencies)
of a class. - Matches between query strings and a class.
- s nviaXgraUgtTablets
- t xviagraTablets
- Matches(s, t) v, ia, gra, Tab, l, ets,
- - But what about overlapping matches?
- Trees properties of the class as a whole.
- size
- density (complexity)
8Document Similarity Measure
The score for a document, d, is the sum of the
scores for each suffix
d(i) is the suffix of d beginning at the ith
letter tau is a tree normalisation coefficient
9Substring Similarity Measure
Score for match, m m0m1m2mn, is score(m)
T is the tree profile of the class. v(mT) is a
normalisation coefficient based on the properties
of T. p(mt) is the probability of the character,
mt, of the match m. Fp is a significance
function.
10Decision Mechanism
11Specifications of Fp(character level)
Note Logit and Sigmoid need to be adjusted to
fit in the range 0,1
12Significance function
13Threshold Variation Significance functions
14Threshold Variation Significance functions
15Match normalisation
m is the set of all strings formed by
permutations of m m is the set of all strings of
length equal to length of m
16Match normalisation
MUN match unnormalised MPN permutation
normalised MLN length normalised
17Threshold Variation match normalisation
Constant significance functionunnormalised
Constant significance functionmatch normalised
18Specifications of tau
19Tree normalisation
20Androutsopoulos et al. (2000) Ling-Spam Corpus
21 Ling-BKS Corpus
SpamAssassin Corpus
22Conclusions
- Good overall classifier- improvement on naïve
Bayes- but theres still room for improvement - Can one method ever maintain 100 accuracy?
- Extending the classifier
- Applications to other domains- web page
classification
23Future Work - ODP
24Computational Performance
25Experimental Data Sets
- Ling-Spam (LS)Spam (481) collected by
Androutsopoulos et al. Ham (2412) from online
linguists bulletin board - Spam Assassin- Easy (SAe)- Hard (SAh)Spam
(1876) and ham (4176) examples donated - BBKSpam (652) collected by Birkbeck
26Androutsopoulos et al. (2000) Ling-Spam Corpus
27Androutsopoulos et al. (2000) Ling-Spam Corpus
28 SpamAssassin Corpus
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Vector Space Model
What then? sang Platos ghost, What then?
W. B. Yeats
Word Probability
0.05
50/1000
P(w what)
33Creating Profiles
Mark
34Profiles
Mark Levene
data
databases
information
search
engines
Mike Hu
data
intelligence
criminal
computational
police
35Classification
SBM
SML
SMH
36Naïve Bayes(similarity measure)
For a document d d1d2d3 dm and set of
classes c c1, c2 ... cJ
(1)
Where
(2)
(3)
37Criticisms
- Pre-processing- Stop-word removal- Word
stemming/lemmatisation- Punctuation and
formatting - Smallest unit of consideration is a word.
- Classes (and documents) are bags of words, i.e.
each word is independent of all others.
38Word Dependencies
Boris Mirkin
data
intelligence
clustering
computational
means
Mike Hu
data
intelligence
criminal
computational
means
39Word Inflections
Intelligent
Intelligence
Intelligentsia
Intelligible
40Success measures
- Recall is the proportion of correctly classified
examples of a class. If SR is spam recall, then
(1-SR) gives the proportion of false negatives. - Precision is the proportion assigned to a class
which are true members of that class. It is a
measure of the number of true positives. If SP
is spam precision, then (1 SP) would give the
proportion of false positives.
41Success measures
- True Positive Rate (TPR) is the proportion of
correctly classified examples of the positive
class. Spam is typically taken as the positive
class, so TPR is then the number of spam
classified as spam over the total number of spam.
- False Positive Rate (FPR) is the proportion of
the negatve class erroneously assigned to the
positive class. - Ham is typically taken as the negative class, so
FPR is then the number of ham classified as spam
over the total number of ham.
42Classifier Structure
Spam
Ham
- Training Data
- Profiling Method
- Profile Representation
- Similarity/Comparison Measure
- Decision Mechanism or Classification Criterion
- Decision
?
Ham
Spam
43Classification using a suffix tree
- Method of profiling is construction of the
tree(no pre-processing, no post-processing) - The tree is a profile of the class.
- Similarity measure?
- Decision mechanism?
44Threshold Variation match normalisation
Constant significance functionunnormalised
Constant significance functionmatch normalised
SPE spam precision error HPE ham precision
error
45Threshold Variation Significance functions
Root function, no normalisation
Logit function, no normalisation
SPE spam precision error HPE ham precision
error
46Threshold Variation
Constant significance function(unnormalised)
SPE spam precision error HPE ham precision
error