A Unified Model of Spam Filtration - PowerPoint PPT Presentation

About This Presentation

Title:

A Unified Model of Spam Filtration

Description:

Weights are multiplied by a promotion / demotion factor if the results are below ... Can pass a row vector of weights from the feature generation column vectors ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 31

Provided by: crm114Sou

Category:

more less

Transcript and Presenter's Notes

Title: A Unified Model of Spam Filtration

1
A Unified Model of Spam Filtration

William Yerazunis1, Shalendra Chhabra2, Christian
Siefkes3, Fidelis Assis4, Dimitrios Gunopulos2
1Mitsubishi Electric Research Labs, Cambridge, MA
2University of California Riverside, CA
3GKVI/ FU Berlin, Germany
4Embractel, Rio de Janeiro, Brazil

2
Spam is a Nuisance Spam Statistics
3
The Problem of Spam Filtering

Given a large set of email readers (who
desire to reciprocally communicate without
prearrangement with each other) and another set
of email spammers (who wish to communicate with
the readers who do not wish to communicate with
the spammers), the set of filtering actions are
the steps readers take to maximize their desired
communication and minimize their undesired
communications.

4
Motivation

All current spam filters have very similar
performance why?
Can we learn anything about current filters by
modeling them?
Does the model point out any obvious flaws in our
current designs?

5
What do you MEAN?Similar Performance?

Gordon Cormack's Mr. X test suite all filters
(heuristic, statistical) get about 99 accuracy.
Large variance of reported real-user results for
all filters everything from nearly perfect to
unusable on the same software.
Conclusion This problem is evil.

6
Method

Examine a large number of spam filters
Think Real Hard (Feynmann method of deduction)
Come up with a unified model for spam filters
Examine the model for weaknesses and scalability
Consider remedies for the weaknesses

7
The Model

Models information flow
Considers stateful v. stateless computations and
computational complexity
Simplification training is usually done
offline, so we will assume a stationary (no
learning during a filtering) filter configuration
for the computational model.

8
Empirical Stages in Filtering Pipeline

Preprocessing text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding

9
Empirical Stages in Filtering Pipeline

Preprocessing text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding
Everybody does it this way!

10
Empirical Stages in Filtering Pipeline

Preprocessing text-to-text transform
Tokenization
Feature Generation
Statistics Table Lookup
Mathematical Combination
Thresholding
Everybody does it this way!
Including SpamAssassin

11
The Obligatory Flowchart

Note that this is now a pipeline anyone for a
scalable solution?

12
Step 1 Pre Processing Arbitrary Text to Text
Transformation

Character Set Folding / Case Folding
Stopword Removal
MIME Normalization / Base64 Decoding
HTML Decommenting
Hypertextus Interruptus
Heuristic Tagging
FORGED_OUTLOOK_TAGS
Identifying Lookalike Transformations
_at_ instead of a, instead of S

13
Step 2 Tokenization

Converting incoming text (and heuristic tags)
into features
Two step process
Use regular expression (regex) to segment the
incoming text into tokens.
We will then convert this token stream T in
features.
This conversion is not necessarily 11

14
Step 3 Converting Tokens into Features

A Token is a piece of text that meets the
requirements of the tokenizing regex
A Feature is a mostly-unique identifier that
the filter's trained database can convert into a
mathematically manipulable value

15
Converting Tokens into Featuresfirst- convert
Text to ArbInteger

For each unknown text token, convert that text
into a (preferably unique) arbitrary integer.
This conversion can be done by dictionary lookup
(i.e. foo is the 82,934th word in the
dictionary) or by hashing.
Output of this stage is a stream of unique
integer IDs. Call this row vector T.

16
Second partMatrix-based Feature Generation

Use a rectangular feature generation profile
matrix P each column of P is a vector containing
small primes (or zero)
The dot product of each column of P against a
segment of the token stream T is a unique
encoding of a segment of the feature stream at
that position.
The number of features out per element of T is
equal to the column rank of P
Zero elements in a P column indicate that token
is disregarded in this particular feature

17
Matrix-based Feature GenerationExample Unigram
Bigram features

A P matrix for unigram plus bigram features is
1 2
0 3
If the incoming token stream T is 10, 20, 30...
then successive dot products of TP yield
10 1 20 0 10 lt--- first position, first
column
10 2 20 3 80 lt-- first position,
second column
20 1 30 0 30 lt--- second position,
first column
20 2 30 3 130 lt-- second position,
second col

18
Matrix-based Feature GenerationNice Side Effects

Zeroes in a P matrix indicate disregard this
location in the token stream
Identical primes in a column of a P matrix allow
order-invariant feature outputs, eg
1 2
0 2
generates the unigram features, AND the set of
bigrams in an order-invariant style (ie. foo
bar bar foo)

19
Matrix-based Feature GenerationNice Side Effects
II

Identical primes in different columns of a P
matrix allow order-sensitive distance-invariant
feature outputs, eg
1 2 2
0 3 0
0 0 3
generates the unigram features and the set of
bigrams in an order-sensitive distance-invariant
style
(ie. foo bar foo ltskipgt bar)

20
Step 4Feature Weighting by Lookup

Feature Lookup tables are pre-built by learning
side of filter EG
Strict Probability
Local Weight TimesSeenInClass
TimesSeenOverAllClasses
Buffered probability
Local Weight __________TimesSeenInClass
(TimesSeenOverallClasse
s constant)
Document Count Certainty
Weight TimesSeenInThisClass DocumentsTrainedInt
oThisClass
(TimesSeenOverallClassesConstant)TotalDocumentsA
ctuallyTrained

21
In this model, Weight Generators Do Not Have To
Be Statistical

SpamAssassin Weights generated by genetic
optimization algorithm
SVM weights generated by linear algebra
Winnow Algorithm
uses additive weights stored in the database
Each features weight starts at 1.0000
Weights are multiplied by a promotion / demotion
factor if the results are below /above a
predefined threshold during learning

22
Unique Features --gtNonuniform Weights

Because all features are unique, we can have
nonuniform weighting schemes
Can have nonuniform weights compiled into the
weighting lookup tables
Can pass a row vector of weights from the feature
generation column vectors

23
Step 5 Weight Combination1st Stateful Step

Winnow Combining
Bayesian Combining
Chi squared Combining
Sorted Weight Approach - Only use the most
extreme N weights found in the document
Uniqueness use only 1st occurrence of any
feature
Output a state vector may be of length 1, or
longer, and may be nonuniform.

24
Final Threshold

Comparison to either a fixed threshold
or
Comparison of one part of the output state to
another part of the output state (useful when the
underlying mathematical model includes a null
hypothesis)

25
Emulation of Other Filtering Methods In the
Unified Filtration Model

Emulating Whitelists and Blacklists
Voting-style whitelists
Prioritized-rule Blacklists
--gt Compile entries into lookup tables with
superincreasing weights for each whitelist or
blacklisted term

26
Emulation of Other Filtering Methods In the
Unified Filtration Model

Emulating Heuristic Filters
Use text-to-text translation to insert text tag
strings corresponding to each heuristic satisfied
Compile entries in the lookup table corresponding
to the positive or negative weight of each
heuristic text tag string ignore all other text.
Sum outputs and threshold
This is the SpamAssassin model

27
Emulation of Other Filtering Methods In the
Unified Filtration Model

Emulating Bayesian Filters
Text-to-text translation as desired
Use a unigram (SpamBayes) or digram (Dspam) P
matrix to generate the features
Preload the lookup tables with local
probabilities
Use Bayes Rule or Chi-Squared to combine
probabilities, and threshold at some value
This is just about every statistical filter...

28
Conclusions

Everybody's filter is more or less the same!
But- filtration on only the text limits our
information horizon.
We need to broaden the information input horizon
... but from where? how?

29
Conclusions

Better information input - examples
CAMRAM use outgoing email as auto whitelist
Smart Squid look at your web site surfing to
infer what might be legitimate email (say,
receipts from online merchants)
Honey pots sources of new spam, and newly
compromised zombie spambots (a.k.a realtime
blacklists, inoculation, peer-to-peer filter
systems). Large ISPs have a big statistical
advantage here over single users.

30
Thank You!