A Unified Model of Spam Filtration - PowerPoint PPT Presentation

About This Presentation
Title:

A Unified Model of Spam Filtration

Description:

Weights are multiplied by a promotion / demotion factor if the results are below ... Can pass a row vector of weights from the feature generation column vectors ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 31
Provided by: crm114Sou
Category:

less

Transcript and Presenter's Notes

Title: A Unified Model of Spam Filtration


1
A Unified Model of Spam Filtration
  • William Yerazunis1, Shalendra Chhabra2, Christian
    Siefkes3, Fidelis Assis4, Dimitrios Gunopulos2
  • 1Mitsubishi Electric Research Labs, Cambridge, MA
  • 2University of California Riverside, CA
  • 3GKVI/ FU Berlin, Germany
  • 4Embractel, Rio de Janeiro, Brazil

2
Spam is a Nuisance Spam Statistics
3
The Problem of Spam Filtering
  • Given a large set of email readers (who
    desire to reciprocally communicate without
    prearrangement with each other) and another set
    of email spammers (who wish to communicate with
    the readers who do not wish to communicate with
    the spammers), the set of filtering actions are
    the steps readers take to maximize their desired
    communication and minimize their undesired
    communications.

4
Motivation
  • All current spam filters have very similar
    performance why?
  • Can we learn anything about current filters by
    modeling them?
  • Does the model point out any obvious flaws in our
    current designs?

5
What do you MEAN?Similar Performance?
  • Gordon Cormack's Mr. X test suite all filters
    (heuristic, statistical) get about 99 accuracy.
  • Large variance of reported real-user results for
    all filters everything from nearly perfect to
    unusable on the same software.
  • Conclusion This problem is evil.

6
Method
  • Examine a large number of spam filters
  • Think Real Hard (Feynmann method of deduction)
  • Come up with a unified model for spam filters
  • Examine the model for weaknesses and scalability
  • Consider remedies for the weaknesses

7
The Model
  • Models information flow
  • Considers stateful v. stateless computations and
    computational complexity
  • Simplification training is usually done
    offline, so we will assume a stationary (no
    learning during a filtering) filter configuration
    for the computational model.

8
Empirical Stages in Filtering Pipeline
  • Preprocessing text-to-text transform
  • Tokenization
  • Feature Generation
  • Statistics Table Lookup
  • Mathematical Combination
  • Thresholding

9
Empirical Stages in Filtering Pipeline
  • Preprocessing text-to-text transform
  • Tokenization
  • Feature Generation
  • Statistics Table Lookup
  • Mathematical Combination
  • Thresholding
  • Everybody does it this way!

10
Empirical Stages in Filtering Pipeline
  • Preprocessing text-to-text transform
  • Tokenization
  • Feature Generation
  • Statistics Table Lookup
  • Mathematical Combination
  • Thresholding
  • Everybody does it this way!
  • Including SpamAssassin

11
The Obligatory Flowchart
  • Note that this is now a pipeline anyone for a
    scalable solution?

12
Step 1 Pre Processing Arbitrary Text to Text
Transformation
  • Character Set Folding / Case Folding
  • Stopword Removal
  • MIME Normalization / Base64 Decoding
  • HTML Decommenting
  • Hypertextus Interruptus
  • Heuristic Tagging
  • FORGED_OUTLOOK_TAGS
  • Identifying Lookalike Transformations
  • _at_ instead of a, instead of S

13
Step 2 Tokenization
  • Converting incoming text (and heuristic tags)
    into features
  • Two step process
  • Use regular expression (regex) to segment the
    incoming text into tokens.
  • We will then convert this token stream T in
    features.
  • This conversion is not necessarily 11

14
Step 3 Converting Tokens into Features
  • A Token is a piece of text that meets the
    requirements of the tokenizing regex
  • A Feature is a mostly-unique identifier that
    the filter's trained database can convert into a
    mathematically manipulable value

15
Converting Tokens into Featuresfirst- convert
Text to ArbInteger
  • For each unknown text token, convert that text
    into a (preferably unique) arbitrary integer.
  • This conversion can be done by dictionary lookup
    (i.e. foo is the 82,934th word in the
    dictionary) or by hashing.
  • Output of this stage is a stream of unique
    integer IDs. Call this row vector T.

16
Second partMatrix-based Feature Generation
  • Use a rectangular feature generation profile
    matrix P each column of P is a vector containing
    small primes (or zero)
  • The dot product of each column of P against a
    segment of the token stream T is a unique
    encoding of a segment of the feature stream at
    that position.
  • The number of features out per element of T is
    equal to the column rank of P
  • Zero elements in a P column indicate that token
    is disregarded in this particular feature

17
Matrix-based Feature GenerationExample Unigram
Bigram features
  • A P matrix for unigram plus bigram features is
  • 1 2
  • 0 3
  • If the incoming token stream T is 10, 20, 30...
    then successive dot products of TP yield
  • 10 1 20 0 10 lt--- first position, first
    column
  • 10 2 20 3 80 lt-- first position,
    second column
  • 20 1 30 0 30 lt--- second position,
    first column
  • 20 2 30 3 130 lt-- second position,
    second col

18
Matrix-based Feature GenerationNice Side Effects
  • Zeroes in a P matrix indicate disregard this
    location in the token stream
  • Identical primes in a column of a P matrix allow
    order-invariant feature outputs, eg
  • 1 2
  • 0 2
  • generates the unigram features, AND the set of
    bigrams in an order-invariant style (ie. foo
    bar bar foo)

19
Matrix-based Feature GenerationNice Side Effects
II
  • Identical primes in different columns of a P
    matrix allow order-sensitive distance-invariant
    feature outputs, eg
  • 1 2 2
  • 0 3 0
  • 0 0 3
  • generates the unigram features and the set of
    bigrams in an order-sensitive distance-invariant
    style
  • (ie. foo bar foo ltskipgt bar)

20
Step 4Feature Weighting by Lookup
  • Feature Lookup tables are pre-built by learning
    side of filter EG
  • Strict Probability
  • Local Weight TimesSeenInClass
  • TimesSeenOverAllClasses
  • Buffered probability
  • Local Weight __________TimesSeenInClass
  • (TimesSeenOverallClasse
    s constant)
  • Document Count Certainty
  • Weight TimesSeenInThisClass DocumentsTrainedInt
    oThisClass
  • (TimesSeenOverallClassesConstant)TotalDocumentsA
    ctuallyTrained

21
In this model, Weight Generators Do Not Have To
Be Statistical
  • SpamAssassin Weights generated by genetic
    optimization algorithm
  • SVM weights generated by linear algebra
  • Winnow Algorithm
  • uses additive weights stored in the database
  • Each features weight starts at 1.0000
  • Weights are multiplied by a promotion / demotion
    factor if the results are below /above a
    predefined threshold during learning

22
Unique Features --gtNonuniform Weights
  • Because all features are unique, we can have
    nonuniform weighting schemes
  • Can have nonuniform weights compiled into the
    weighting lookup tables
  • Can pass a row vector of weights from the feature
    generation column vectors

23
Step 5 Weight Combination1st Stateful Step
  • Winnow Combining
  • Bayesian Combining
  • Chi squared Combining
  • Sorted Weight Approach - Only use the most
    extreme N weights found in the document
  • Uniqueness use only 1st occurrence of any
    feature
  • Output a state vector may be of length 1, or
    longer, and may be nonuniform.

24
Final Threshold
  • Comparison to either a fixed threshold
  • or
  • Comparison of one part of the output state to
    another part of the output state (useful when the
    underlying mathematical model includes a null
    hypothesis)

25
Emulation of Other Filtering Methods In the
Unified Filtration Model
  • Emulating Whitelists and Blacklists
  • Voting-style whitelists
  • Prioritized-rule Blacklists
  • --gt Compile entries into lookup tables with
    superincreasing weights for each whitelist or
    blacklisted term

26
Emulation of Other Filtering Methods In the
Unified Filtration Model
  • Emulating Heuristic Filters
  • Use text-to-text translation to insert text tag
    strings corresponding to each heuristic satisfied
  • Compile entries in the lookup table corresponding
    to the positive or negative weight of each
    heuristic text tag string ignore all other text.
  • Sum outputs and threshold
  • This is the SpamAssassin model

27
Emulation of Other Filtering Methods In the
Unified Filtration Model
  • Emulating Bayesian Filters
  • Text-to-text translation as desired
  • Use a unigram (SpamBayes) or digram (Dspam) P
    matrix to generate the features
  • Preload the lookup tables with local
    probabilities
  • Use Bayes Rule or Chi-Squared to combine
    probabilities, and threshold at some value
  • This is just about every statistical filter...

28
Conclusions
  • Everybody's filter is more or less the same!
  • But- filtration on only the text limits our
    information horizon.
  • We need to broaden the information input horizon
  • ... but from where? how?

29
Conclusions
  • Better information input - examples
  • CAMRAM use outgoing email as auto whitelist
  • Smart Squid look at your web site surfing to
    infer what might be legitimate email (say,
    receipts from online merchants)
  • Honey pots sources of new spam, and newly
    compromised zombie spambots (a.k.a realtime
    blacklists, inoculation, peer-to-peer filter
    systems). Large ISPs have a big statistical
    advantage here over single users.

30
Thank You!
  • questions or comments?
Write a Comment
User Comments (0)
About PowerShow.com