AntiPhish Lessons Learnt - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

AntiPhish Lessons Learnt

Description:

Identity theft for social network sites, e.g. myspace.com ... Add set At to the training set, train the new model Mt. Details ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 51
Provided by: gerha
Category:

less

Transcript and Presenter's Notes

Title: AntiPhish Lessons Learnt


1
AntiPhish Lessons Learnt
  • André Bergholz
  • Fraunhofer IAIS, St. Augustin
  • Workshop on CyberSecurity and Intelligence
    Informatics (CSI-KDD)
  • June 28th, 2009

2
Phishing
  • E-mail fraud
  • Send officially looking email
  • Include web link or form
  • Ask for confidential information, e.g., password,
    account details
  • Attacker uses information to withdraw money,
    enter computer system, etc.

3
Phishing Target Sites
  • Target customers of banks and online payment
    services
  • Obtain sensitive data from U.S. taxpayers by
    pretended IRS- emails
  • Identity theft for social network sites, e.g.
    myspace.com
  • Recently more non-financial brands were attacked
    including social networking, VOIP, and numerous
    large web-based email providers.

http//www.antiphishing.org/
4
Phishing Techniques
  • Upward trend in number of phishing mails sent
  • Massive increase of phishing sites over the past
  • Increasing sophistication
  • Link manipulation, URL spelling
  • Website address manipulation
  • Evolution of phishing methods from shotgun-style
    email
  • Image phishing
  • Spear phishing (targeted)
  • Voice over IP phishing
  • Whaling High-profile people

http//www.antiphishing.org/
5
Phishing Damage
  • Gartner (The War on Phishing Is Far From Over,
    2009)
  • 5 million US consumers affected between 09/2007
    and 2008 (39.8 increase)
  • Average loss per consumer 351 (60 decrease),
    Total loss 1.8 billion dollars
  • Top-Three most attacked countries USA, UK, Italy
    RSA Online Fraud Report, 2009
  • 90 of internet users are fooled by good phishing
    websites Dhamija et al., SIGCHI 2006
  • For the individual phisher low-skill, low-income
    business Herley and Florencio, New Security
    Paradigms Workshop, 2008

http//www.antiphishing.org/
6
Approaches against Phishing
  • Network- and Encryption-based countermeasures
    Email authentification, two factor
    authentification, mobile TANs, etc.
  • Blacklisting and whitelisting Lists of phishing
    sites and legitimate sites
  • Content-based filtering for websites and emails
  • Typical formulations urging the user to enter
    confidential information
  • Design elements, trademarks, and logos for known
    brands (only relatively few brands are attacked)
  • Spoofed sender addresses and URLs
  • Invisible content inserted to fool automatic
    filtering approaches
  • Images containing the message text

7
EU-Project AntiPhish
  • Period 01/2006 06/2009
  • Develop content-based phishing filters
  • Use realistic email corpora
  • Deploy in realistic workflows
  • Trainable and adaptive filtersè adapt to new
    phishing attacksè anticipate attacks

8
Agenda
è
  • Email Classification based on Advanced Text
    Mining
  • Hidden Salting and Anticipating Evasion
  • Real-Life AntiPhish Deployment
  • Conclusions

9
Phishing Filtering as Classification Problem
  • Task Automatically classify emails based on
    content
  • Use email features relevant to detect phishing
  • Training data emails labeled with classes ham,
    spam, phishing
  • Train a classifier
  • Apply to new emails

10
Message Preprocessing
Standardized email data file (flat
representation)
Structured representationincluding embedded
images, attachments
11
Basic Features
  • Can be derived directly from the email itself,
    i.e., do not require information about specific
    websites
  • Structural Features (4) Number of body parts
    (total, discrete, composite, alternative)
  • Link Features (8) Number of links (total,
    internal, external, w/ IP-numbers, deceptive,
    image), Number of dots, Action word links
  • Element Features (4) HTML, Scripts, JavaScript,
    Forms
  • Spam Filter Features (2) SpamAssassin (untrained)
    score and classification
  • Word List Features (9) Indicator words, e.g.,
    account, update, confirm, verify, secur, notif,
    log, click, inconvenien

12
Dynamic Markov Chains
  • Operate on the bit representation of the natural
    language text of the email
  • Model a bit sequence as a stationary and ergodic
    Markov source with limited memory

01010010100100101110101001011010010101001010100111
01001010101010101
  • Incrementally build such an automaton / Markov
    chain to model the training sequences
  • Train one DMC for each of the classes (i.e., ham,
    spam, phishing), For a new email look which model
    fits best
  • Has been successfully applied to spam
    classificationBratko et al., JMLR 2006

13
Dynamic Markov Chains Details
  • States Two probabilities representing the
    likelihood that the source emits 1 or 0 as next
    symbol
  • Prediction Move through automaton, add up
    likelihoods
  • Training (incremental) States are cloned when
    reached via a frequently used transition
  • Model size reduction Use training examples that
    the model cannot already classify well enough
    (after some initial training, see also
    uncertainty sampling in active learning)
  • Features Expected cross entropies of a message
    for either model (ham and phishing), Boolean
    membership indicators

14
Latent Topic Models
  • Analyze on the co-occurrence of words
  • Similar to word clustering Specify the number of
    topics in advance
  • Common methods LDA, PLSA
  • Probabilistic latent semantic analysis Models
    the probability of each co-occurrence as a
    mixture of conditionally independent multinomial
    distributions
  • Latent Dirichlet allocation Generative Bayesian
    version with Dirichlet prior
  • Document Mixture of various topics

15
Latent Topic Models Class Specific
  • Analyze on the co-occurrence of words Class-Topic
    Model (CLTOM) Extension of LDA
  • Incorporates class information
  • LDA Uniform per-document topic Dirichlet prior
    a, uniform per-topic word Dirichlet prior b
  • CLTOM Class-specific per-document
    topicDirichlet prior ac
  • Training using EM / Mean Field Approximation
  • Features Probabilities for each topic

16
Latent Topic Model Topics
Relevance for phishing
Words of a topic sorted by probability
17
Feature Processing and Selection
  • Feature Processing
  • Scaling Guarantees that all features have
    values within the same range
  • Normalization Sets length of the feature
    vectors to one, which is adequate for
    inner-product based classifiers
  • Feature Selection
  • Goal Select a subset of relevant features
  • Abstract Search in a state space Kohavi and
    John, AI Journal 1997
  • Operates on an independent validation set
  • Best-first search strategy Expands the current
    subset by the node with the highest estimated
    performance, stores additional nodes to overcome
    local maxima
  • Compound operators Combine set of
    best-performing children

18
Evaluation Method and Test Corpus
  • Standard method 10-fold cross-validation
  • Criteria Precision, recall, F-measure, false
    positive rate, false negative rate, accuracy for
    comparison with related work
  • Note Errors are not of equal importance
  • Test Corpus Assembled by Fette et al., WWW
    2007
  • Ham emails SpamAssassin corpus
  • Phishing emails Collected by Nazario
  • Total size 7808 emails, 6951 ham (89) and 857
    phishing (11)

19
Overall result
Missed phishing emails
Lost ham emails
  • FPR reduced by 92, FNR by 64
  • Statistically significant difference to Fette et
    al. 07 with less than 1 error
  • Feature selection Better result with fewer
    features and less training data (20 reserved for
    validation)

20
Agenda
  • Email Classification based on Advanced Text
    Mining
  • Hidden Salting and Anticipating Evasion
  • Real-Life AntiPhish Deployment
  • Conclusions

è
21
Salting
  • Salting Intentional addition or distortion of
    content to evade automatic filtering
  • Can be applied to any medium (e.g., text, images,
    audio) and to any content genre (e.g., emails,
    web pages, MMS messages)
  • Visible salting Additional text, images
    containing random pixels, etc.
  • Hidden salting Not perceivable by the user
    (e.g., text in invisible color, text behind
    objects, reading order manipulation)

22
Email source text
internal representation
drawing canvas
end user
lthtmlgt ltheadgt lt/headgt ltbodygt lth1gtA
storylt/h1gt ltpgt Once there was lt/pgt lt/bodygt lt/h
tmlgt
lthtmlgt
A story Once there was a noble prince. He lived
in a fancy castle. Read more
ltheadgt
ltbodygt
lth1gt
ltpgt
ltpgt
rendering
ltemgt
ltagt
23
Hidden Salting Simulation
  • We tap into the rendering process to detect
    hidden content, i.e., manifestations of salting.
  • Intercept requests for drawing text primitives
  • Build an internal representation of the
    characters, i.e., a list of attributed glyphs in
    compositional order
  • Test for glyph visibility
  • Clipping The glyph is drawn within the physical
    bounds of the drawing clip.
  • Concealment The glyph is not concealed by other
    glyphs or shapes.
  • Font Color The glyphs fill color contrasts well
    with the background color.
  • Glyph Size The glyphs size and shape is
    sufficiently large.

24
Hidden Salting Simulation (cont.)
  • We feed the intercepted, visible text into a
    cognitive hidden salting simulation model, which
    returns the simulated perceived text.
  • Reading order Detected based on a layout
    characteristic where we expect that glyphs of
    parallel lines are aligned
  • Compliance of the text with the language specific
    distributions of character n-grams, common words
    and word lengths
  • For details See De Beer and Moens, Tech. Report
    KU Leuven 2007

25
Evasion Detection
  • Cat-and-Mouse game Spammers are developing
    tricks filter developers are adapting their
    filters
  • So far Hidden salting simulation model
  • Closing the loop Identifying email messages that
    are likely to make the hidden salting simulation
    system fail
  • Method Compare the simulated perceived text as
    generated by our hidden salting simulation system
    and the message text as obtained by applying OCR
    to the rendered email message

26
Evasion Detection Approach
27
Example
HTML Source
Email on Screen
lthtmlgt ltbodygt ltfont color"ffffff"gtINNOCENT TEXT
TO TRICK FILTER lt/fontgt ltpgtYour home refinance
loan is approved!ltbrgtlt/pgtltbrgt ltpgtTo get your
approved amount lta href"http//www.mortgagepower3
.com/"gtgo herelt/agt.lt/pgt ltbrgtltbrgtltbrgt ltpgtTo be
excluded from further notices lta
href"http//www.mortgagepower3.com/remove.html"gt
go herelt/agt.lt/pgt lt/bodygt ltfont color"ffffff"gt1gat
e lt/htmlgt 5297gdqK6-498jyxl3033RafD3-195RTcz6485ob
QU9-615LOLg9l49
Hidden Salting Simulation
OCR Text
INNOCENT TEXT TO TRICK FILTER Your home refinance
loan is approved! To get your approved amount go
here. To be excluded from further notices go here.
Your _ome refinance loan is approve_! To get your
approve_ amount _o_o _ere. To De exclu_e_ from
furt_er notices __o _ere.
Detect Difference
28
Evaluation Method
  • Method Simulate the detection of a new salting
    trick by disabling the detection of one of the
    known tricks
  • Classifier One-class SVM
  • Training set One-class Class of normal
    emails, i.e., emails that contain no or only
    known salting tricks
  • Test set Both emails with and without the
    disabled (new) salting trick
  • Features Robust text distance measures
  • Classifier marks outliers, i.e., emails that are
    not in the one class, which indicates that they
    may contain a previously unseen salting trick
  • Classifier produces some real-valued output, we
    automatically compute the cutting threshold by
    reapplying the classifier to the training set
  • OCR engines gocr, ocrad

29
Test Data
  • 6951 ham, 2154 spam messages, 4559 phishing
    messages from SpamAssassin and Nazario corpus
  • Considered tricks Font color, font size
  • Training set 800 messages w/o trick
  • Test set 100 messages with / 300 messages w/o
    trick

30
Overall Result
31
Agenda
  • Email Classification based on Advanced Text
    Mining
  • Hidden Salting and Anticipating Evasion
  • Real-Life AntiPhish Deployment
  • Conclusions

è
32
Filtering a Real-Life Email Stream Challenges
  • Fixed scenario with fixed parameters
  • Data
  • From present real-life stream
  • Mostly English and Italian
  • (Almost) Unskewed
  • All data is unlabeled, not easy to eliminate spam
  • Very strict privacy regulations
  • Experiments Almost online

33
General Deployment Approach
Start Initial AntiPhish model M0
  • For every day t Î 1, . . . , n
  • Capture a set of emails St , sent in real-time
    through spam filters
  • Select a test subset Tt Ì St for evaluation of
    the current AntiPhish model Mt-1
  • Select a subset At Ì St of emails that are
    difficult to classify to be used for active
    learning
  • Obtain labels for sets Tt and At
  • Evaluate current model Mt-1 on the set Tt
  • Add set At to the training set, train the new
    model Mt

34
Details
  • AntiPhish system is evaluated on arbitrary
    collected emails.
  • Deployment period n 20 days.
  • Used features unigram, DMC, semantic topics with
    k 25 topics, link, and lexical features
  • Every day a total of Tt È At 750 emails are
    selected.
  • An email is classified as non-ham if and only if
    it is considered with a probability of at least
    95 to be non-ham.

35
Stratified Evaluation
  • Tt Stratified sample of its underlying base set
    St
  • Idea. Better represent interesting emails
  • Two buckets Emails that are difficult or easy to
    classify
  • Basic procedure Oversample the difficult emails,
    but give them a lower weight in evaluation
  • More specifically Let St St(u) È St(c) , we
    want to sample k1 and k2 emails respectively
  • Then
  • We use a probability of p 95 (for non-ham) as
    certainty threshold.

36
Active Learning
Email Stream
Previous Trainingset
Current Model
  • Set of additional training emails per day At,
    At 500
  • 400 top-ranked emails from St having the lowest
    confidence in classification
  • . . . plus 100 emails randomly selected from the
    rest of St
  • Minimization of duplicates among the 400
    uncertain emails Ignore duplicates

Uncertain Emails
Certain Emails
Randomover sampling
Random undersampling
New Trainingset
Training
New Model
37
Initial Dataset
  • Initial dataset Six days of 750 messages each
  • Total 4489 messages
  • Ham 1514 (34)
  • Phishing 1342 (30)
  • Spam 1633 (36)
  • Non-Ham 2975 (66)
  • Time period for experiment subsequent 20 days

38
Additional Training Data Through Active Learning
39
Test Data and Evaluation
  • 250 messages per day
  • k1 k2 125 difficult and easy messages
  • Sometimes less, because not enough difficult
    emails were found
  • Evaluation
  • False Positive Rate Proportion of lost ham
    emails in all ham emails
  • False Negative Rate Proportion of missed non-ham
    emails in all non-ham emails

40
Test Data
41
Baseline Result
Ham classified as Non-Ham
  • FPR Average 0.34

Non-Ham classified as Ham
FNR Average 7.09
42
Result forSelectedThresholds
Threshold in on predicted probability of Non-Ham
43
Effect of Active Learning
Ham classified as Non-Ham
  • Three different fixed models
  • Initial model M0
  • Model after five days of active learning M5
  • Model after ten days of active learning M10

Non-Ham classified as Ham
44
Effect of Active Learning
45
Spam Filter Vote as Feature
46
Phishing vs. Ham Classification
  • Phishing vs. Ham (instead of Non-Ham vs. Ham)
  • FPR 0, FNR 7.62

47
Identifying Potential Phishing in Spam
  • Second real-life application
  • Anti-spam operations use spam traps to gather the
    latest spam samples so that these can be better
    defended against
  • The ability to separate out the phishing leads to
    a quicker defence against such fraudulent
    activity

HoneypotNetwork
Updatedsignatures
Fast Updateof Spam Filter
Spam Phishing
Phishing
Classifier
RegularSpam
48
Related Laboratory Experiment
  • Labeled data phishing and regular spam from a
    probe network
  • Training 53 phishing vs. 1060 regular spam per
    week
  • Test 75 phishing vs. 1443 regular spam per week
    (on average)
  • Duration June to November 2008 (26 weeks)
  • System Parameters
  • Features DMC, semantic topics with 10 topics,
    unigram, wordlist, DMC-link
  • Threshold Neutral (50)
  • Evaluation Sliding window strategy
  • Each week is filtered on classifier trained on
    previous N4 weeks
  • Result
  • FPR Spam classified as Phishing 0.18
  • FNR Phishing classified as Spam 4.89

49
Sliding Window, Training N4 weeks
Phishing classified as Spam
Spam classified as Phishing
50
Agenda
  • Email Classification based on Advanced Text
    Mining
  • Hidden Salting and Anticipating Evasion
  • Real-Life AntiPhish Deployment
  • Conclusions

è
51
Conclusions Lessons Learnt
  • Phishing Multi-billion dollar activity
  • AntiPhish Phishing prevention through
    content-based email filtering
  • Advanced text mining features boost performance
    Dynamic Markov chains, Latent topic models
  • Most of these techniques are language-independent
  • Anticipatory learning Detecting new filter
    evasion techniques, Require high-speed
    high-quality OCR
  • Real-life deployment
  • Active learning keeps filters up-to-date
  • Combination with spam filters improves
    performance through incorporation of current
    blacklist information.
  • Identifying phishing in a honeypot network
    permits prioritization in spam-filter updating.
Write a Comment
User Comments (0)
About PowerShow.com