Title: A Unified Model of Spam Filtration
1A Unified Model of Spam Filtration
- William Yerazunis1, Shalendra Chhabra2, Christian
Siefkes3, Fidelis Assis4, Dimitrios Gunopulos2 - 1Mitsubishi Electric Research Labs, Cambridge, MA
- 2University of California Riverside, CA
- 3GKVI/ FU Berlin, Germany
- 4Embractel, Rio de Janeiro, Brazil
2Spam is a Nuisance Spam Statistics
3The Problem of Spam Filtering
- Given a large set of email readers (who
desire to reciprocally communicate without
prearrangement with each other) and another set
of email spammers (who wish to communicate with
the readers who do not wish to communicate with
the spammers), the set of filtering actions are
the steps readers take to maximize their desired
communication and minimize their undesired
communications.
4Motivation
- All current spam filters have very similar
performance why? - Can we learn anything about current filters by
modeling them? - Does the model point out any obvious flaws in our
current designs?
5What do you MEAN?Similar Performance?
- Gordon Cormack's Mr. X test suite all filters
(heuristic, statistical) get about 99 accuracy. - Large variance of reported real-user results for
all filters everything from nearly perfect to
unusable on the same software. - Conclusion This problem is evil.
6Method
- Examine a large number of spam filters
- Think Real Hard (Feynmann method of deduction)
- Come up with a unified model for spam filters
- Examine the model for weaknesses and scalability
- Consider remedies for the weaknesses
7The Model
- Models information flow
- Considers stateful v. stateless computations and
computational complexity - Simplification training is usually done
offline, so we will assume a stationary (no
learning during a filtering) filter configuration
for the computational model.
8Empirical Stages in Filtering Pipeline
- Preprocessing text-to-text transform
- Tokenization
- Feature Generation
- Statistics Table Lookup
- Mathematical Combination
- Thresholding
9Empirical Stages in Filtering Pipeline
- Preprocessing text-to-text transform
- Tokenization
- Feature Generation
- Statistics Table Lookup
- Mathematical Combination
- Thresholding
- Everybody does it this way!
10Empirical Stages in Filtering Pipeline
- Preprocessing text-to-text transform
- Tokenization
- Feature Generation
- Statistics Table Lookup
- Mathematical Combination
- Thresholding
- Everybody does it this way!
- Including SpamAssassin
11The Obligatory Flowchart
- Note that this is now a pipeline anyone for a
scalable solution?
12Step 1 Pre Processing Arbitrary Text to Text
Transformation
- Character Set Folding / Case Folding
- Stopword Removal
- MIME Normalization / Base64 Decoding
- HTML Decommenting
- Hypertextus Interruptus
- Heuristic Tagging
- FORGED_OUTLOOK_TAGS
- Identifying Lookalike Transformations
- _at_ instead of a, instead of S
13Step 2 Tokenization
- Converting incoming text (and heuristic tags)
into features - Two step process
- Use regular expression (regex) to segment the
incoming text into tokens. - We will then convert this token stream T in
features. - This conversion is not necessarily 11
14Step 3 Converting Tokens into Features
- A Token is a piece of text that meets the
requirements of the tokenizing regex - A Feature is a mostly-unique identifier that
the filter's trained database can convert into a
mathematically manipulable value
15Converting Tokens into Featuresfirst- convert
Text to ArbInteger
- For each unknown text token, convert that text
into a (preferably unique) arbitrary integer. - This conversion can be done by dictionary lookup
(i.e. foo is the 82,934th word in the
dictionary) or by hashing. - Output of this stage is a stream of unique
integer IDs. Call this row vector T.
16Second partMatrix-based Feature Generation
- Use a rectangular feature generation profile
matrix P each column of P is a vector containing
small primes (or zero) - The dot product of each column of P against a
segment of the token stream T is a unique
encoding of a segment of the feature stream at
that position. - The number of features out per element of T is
equal to the column rank of P - Zero elements in a P column indicate that token
is disregarded in this particular feature
17Matrix-based Feature GenerationExample Unigram
Bigram features
- A P matrix for unigram plus bigram features is
- 1 2
- 0 3
- If the incoming token stream T is 10, 20, 30...
then successive dot products of TP yield - 10 1 20 0 10 lt--- first position, first
column - 10 2 20 3 80 lt-- first position,
second column - 20 1 30 0 30 lt--- second position,
first column - 20 2 30 3 130 lt-- second position,
second col
18Matrix-based Feature GenerationNice Side Effects
- Zeroes in a P matrix indicate disregard this
location in the token stream - Identical primes in a column of a P matrix allow
order-invariant feature outputs, eg - 1 2
- 0 2
- generates the unigram features, AND the set of
bigrams in an order-invariant style (ie. foo
bar bar foo)
19Matrix-based Feature GenerationNice Side Effects
II
- Identical primes in different columns of a P
matrix allow order-sensitive distance-invariant
feature outputs, eg - 1 2 2
- 0 3 0
- 0 0 3
- generates the unigram features and the set of
bigrams in an order-sensitive distance-invariant
style - (ie. foo bar foo ltskipgt bar)
20Step 4Feature Weighting by Lookup
- Feature Lookup tables are pre-built by learning
side of filter EG - Strict Probability
- Local Weight TimesSeenInClass
- TimesSeenOverAllClasses
- Buffered probability
- Local Weight __________TimesSeenInClass
- (TimesSeenOverallClasse
s constant) - Document Count Certainty
- Weight TimesSeenInThisClass DocumentsTrainedInt
oThisClass - (TimesSeenOverallClassesConstant)TotalDocumentsA
ctuallyTrained
21In this model, Weight Generators Do Not Have To
Be Statistical
- SpamAssassin Weights generated by genetic
optimization algorithm - SVM weights generated by linear algebra
- Winnow Algorithm
- uses additive weights stored in the database
- Each features weight starts at 1.0000
- Weights are multiplied by a promotion / demotion
factor if the results are below /above a
predefined threshold during learning
22Unique Features --gtNonuniform Weights
- Because all features are unique, we can have
nonuniform weighting schemes - Can have nonuniform weights compiled into the
weighting lookup tables - Can pass a row vector of weights from the feature
generation column vectors
23Step 5 Weight Combination1st Stateful Step
- Winnow Combining
- Bayesian Combining
- Chi squared Combining
- Sorted Weight Approach - Only use the most
extreme N weights found in the document - Uniqueness use only 1st occurrence of any
feature - Output a state vector may be of length 1, or
longer, and may be nonuniform.
24Final Threshold
- Comparison to either a fixed threshold
- or
- Comparison of one part of the output state to
another part of the output state (useful when the
underlying mathematical model includes a null
hypothesis)
25Emulation of Other Filtering Methods In the
Unified Filtration Model
- Emulating Whitelists and Blacklists
- Voting-style whitelists
- Prioritized-rule Blacklists
-
- --gt Compile entries into lookup tables with
superincreasing weights for each whitelist or
blacklisted term
26Emulation of Other Filtering Methods In the
Unified Filtration Model
- Emulating Heuristic Filters
- Use text-to-text translation to insert text tag
strings corresponding to each heuristic satisfied
- Compile entries in the lookup table corresponding
to the positive or negative weight of each
heuristic text tag string ignore all other text. - Sum outputs and threshold
- This is the SpamAssassin model
27Emulation of Other Filtering Methods In the
Unified Filtration Model
- Emulating Bayesian Filters
- Text-to-text translation as desired
- Use a unigram (SpamBayes) or digram (Dspam) P
matrix to generate the features - Preload the lookup tables with local
probabilities - Use Bayes Rule or Chi-Squared to combine
probabilities, and threshold at some value - This is just about every statistical filter...
28Conclusions
- Everybody's filter is more or less the same!
- But- filtration on only the text limits our
information horizon. - We need to broaden the information input horizon
- ... but from where? how?
29Conclusions
- Better information input - examples
- CAMRAM use outgoing email as auto whitelist
- Smart Squid look at your web site surfing to
infer what might be legitimate email (say,
receipts from online merchants) - Honey pots sources of new spam, and newly
compromised zombie spambots (a.k.a realtime
blacklists, inoculation, peer-to-peer filter
systems). Large ISPs have a big statistical
advantage here over single users.
30Thank You!