A False Positive Safe Neural Network for Spam Detection

About This Presentation
Title:

A False Positive Safe Neural Network for Spam Detection

Description:

A False Positive Safe Neural Network for Spam Detection. Alexandru Catalin Cosoi ... ART False Positives Spam = OMG!!! (ART ) = Heuristic Filter ARTMAP ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: A False Positive Safe Neural Network for Spam Detection


1
  • A False Positive Safe Neural Network for Spam
    Detection

Alexandru Catalin Cosoi acosoi_at_bitdefender.com
2
Does this look familiar?
3
Anatrim
4
Oh boy, its getting worst!!!
5
Oh boy, its getting worst!!!
6
Bad Bad Spammer!!!
  • Databases
  • D Random legitimate text
  • D1 Different rephrases of a certain spam phrase
  • D2 Different rephrases of another spam phrase
  • Dn Different rephrases of another spam phrase
  • Create spam message script
  • Choose a random phrase from D1
  • Choose random text from D
  • Choose a random phrase from D2
  • Choose random text from D
  • .
  • Chose random phrase from Dn
  • Send message.
  • Appeared as a consequence of botnets
  • 40 samples of different subjects
  • 50 samples of different titles
  • 30 samples of different titles (part II)
  • 60000 different combinations

7
Features
  • Larger time frame KeyWord!!!!
  • Weak features
  • Words like Anatrim, Viagra, Xanax, Stock
  • Simple word combinations like Stock alert,
    Strong buy
  • Simple Header Heuristics (for both spam and ham)
    like valid reply, weird message id, forged
    headers
  • Example
  • Top 500 spammy words from a Bayesian dictionary
  • Some simple header heuristics from spamassasins
    SARE Ninjas
  • Trainers personal flavour

8
Why ART?
  • Training occurs by modifying the weights of each
    neuron
  • For large amounts of data, forgetting important
    details might actually happen
  • Solves the stability-plasticity dilemma
  • Based on template detection
  • Unlimited number of templates involves unlimited
    number of patterns
  • 2 self organizing neural networks a mapping
    module supervised organizing neural network

9
Adaptive Resonance Theory
  • Similar to a cluster algorithm (as many clusters
    as needed)
  • ARTMAP ARTa ARTb MapField

10
ART Vigilance
  • A big value Accepts small errors Many small
    clusters High precision
  • A small value Accepts high errors A few big
    clusters Errors can appear

11
ART
12
Algorithm
13
Corpus
  • 2.5 million spam messages (sampled on waves with
    a high degree of variation) and around 1000
    simple low relevance text heuristics (not
    counting the standard header heuristics).
  • The first 1000 words (ordered by discrimination,
    but with a minimum of 10-30 hundred occurrences)
    from a bayesian dictionary trained on this
    corpus, and also standard header heuristics.
  • Almost 1 million legitimate email messages
  • 75 of the message corpus were used for training
    the neural network and,
  • 25 were used in testing the neural network.
  • 1.5 days to train!!!!

14
Results
  • FP 1 0.0001
  • FN 4 20
  • On some corpuses (TREC 2006) we had not so
    great results (but current heuristics)
  • FN 35 (?)
  • FP 2 email messages! (?)
  • At least, just a few false positives!

15
Conclusions
  • ART Simple Features Spam Love
  • ART False Positives Spam OMG!!!
  • (ART) Heuristic Filter ARTMAP
  • Must use a lot of email messages. It is highly
    difficult to find representative samples for
    individual waves.
  • Can also be applied to other neural networks
  • Interesting PowerPoint template

16
Thanks
  • QUESTIONS?
Write a Comment
User Comments (0)