Title: A False Positive Safe Neural Network for Spam Detection
1- A False Positive Safe Neural Network for Spam
Detection
Alexandru Catalin Cosoi acosoi_at_bitdefender.com
2Does this look familiar?
3Anatrim
4Oh boy, its getting worst!!!
5Oh boy, its getting worst!!!
6Bad Bad Spammer!!!
- Databases
- D Random legitimate text
- D1 Different rephrases of a certain spam phrase
- D2 Different rephrases of another spam phrase
-
- Dn Different rephrases of another spam phrase
- Create spam message script
- Choose a random phrase from D1
- Choose random text from D
- Choose a random phrase from D2
- Choose random text from D
- .
- Chose random phrase from Dn
- Send message.
- Appeared as a consequence of botnets
- 40 samples of different subjects
- 50 samples of different titles
- 30 samples of different titles (part II)
- 60000 different combinations
7Features
- Larger time frame KeyWord!!!!
- Weak features
- Words like Anatrim, Viagra, Xanax, Stock
- Simple word combinations like Stock alert,
Strong buy - Simple Header Heuristics (for both spam and ham)
like valid reply, weird message id, forged
headers - Example
- Top 500 spammy words from a Bayesian dictionary
- Some simple header heuristics from spamassasins
SARE Ninjas - Trainers personal flavour
8Why ART?
- Training occurs by modifying the weights of each
neuron - For large amounts of data, forgetting important
details might actually happen - Solves the stability-plasticity dilemma
- Based on template detection
- Unlimited number of templates involves unlimited
number of patterns - 2 self organizing neural networks a mapping
module supervised organizing neural network
9Adaptive Resonance Theory
- Similar to a cluster algorithm (as many clusters
as needed) - ARTMAP ARTa ARTb MapField
10ART Vigilance
- A big value Accepts small errors Many small
clusters High precision - A small value Accepts high errors A few big
clusters Errors can appear
11ART
12Algorithm
13Corpus
- 2.5 million spam messages (sampled on waves with
a high degree of variation) and around 1000
simple low relevance text heuristics (not
counting the standard header heuristics). - The first 1000 words (ordered by discrimination,
but with a minimum of 10-30 hundred occurrences)
from a bayesian dictionary trained on this
corpus, and also standard header heuristics. - Almost 1 million legitimate email messages
- 75 of the message corpus were used for training
the neural network and, - 25 were used in testing the neural network.
- 1.5 days to train!!!!
14Results
- FP 1 0.0001
- FN 4 20
- On some corpuses (TREC 2006) we had not so
great results (but current heuristics) - FN 35 (?)
- FP 2 email messages! (?)
- At least, just a few false positives!
15Conclusions
- ART Simple Features Spam Love
- ART False Positives Spam OMG!!!
- (ART) Heuristic Filter ARTMAP
- Must use a lot of email messages. It is highly
difficult to find representative samples for
individual waves. - Can also be applied to other neural networks
- Interesting PowerPoint template
16Thanks