Identifying Suspicious URLs: An Application of Large-Scale Online Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying Suspicious URLs: An Application of Large-Scale Online Learning

Description:

http://fblight.com. http://mail.ru. 3. Problem in a Nutshell ... facebook.com. fblight.com. 4. Today's Talk. Problem. Approach. Learning to detect malicious URLs ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 25
Provided by: Just48
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Identifying Suspicious URLs: An Application of Large-Scale Online Learning


1
Identifying Suspicious URLs An Application of
Large-Scale Online Learning
  • Justin Ma, Lawrence Saul, Stefan Savage, Geoff
    Voelker
  • Computer Science Engineering
  • UC San Diego
  • Presentation for ICML 2009
  • June 15, 2009

2
Detecting Malicious Web Sites
  • Safe URL?
  • Web exploit?
  • Spam-advertised site?
  • Phishing site?

URL Uniform Resource Locator http//www.cs.mcgi
ll.ca/icml2009/abstracts.html http//www.bfuduui
oo1fp.mobi/ws/ebayisapi.dll http//fblight.com h
ttp//mail.ru
Predict what is safe without committing to risky
actions
3
Problem in a Nutshell
  • URL features to identify malicious Web sites
  • Different classes of URLs
  • Benign, spam, phishing, exploits, scams...
  • For now, distinguish benign vs. malicious

facebook.com
fblight.com
4
Today's Talk
  • Problem
  • Approach
  • Learning to detect malicious URLs
  • Challenges scale and non-stationarity
  • Evaluations
  • Need for large, fresh training sets
  • Online learning
  • Conclusion

5
State of the Practice
  • Current approaches
  • Blacklists
  • Learning on hand-tuned features
  • Limitations
  • Cannot learn from newest examples quickly
  • Cannot quickly adapt to newest features
  • Arms race fast feedback cycle is critical

More automated approach?
6
Live URL Classification System
Label
Example
Hypothesis
7
Live Training Feed
  • Malicious URLs (spamming and phishing)
  • 6,0007,500 per day from Web mail provider
  • Benign URLs
  • From Yahoo Web directory
  • Total of 20,000 URLs per day
  • Live collection since Jan. 5, 2009
  • Months of data
  • Two million examples after 100 days

8
Feature vector construction
http//www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration 3/25/2009 Hosted from
208.78.240.0/22 IP hosted in San Mateo Connection
speed T1 Has DNS PTR record? Yes Registrant
Chad ...
_ _ 0 0 0 1 1 1 1 0 1 1
Host-based
Lexical
Real-valued
60 features
1.8 million
1.1 million
GROWING
Day 100
9
Live URL Classficiation System
Online learning
10
Practical Challenges of ML in Systems
  • Industrial concerns
  • Scale millions of examples, features
  • Non-stationarity examples change over time (arms
    race w/ criminals)
  • Pivotal decision batch or online?

11
Batch vs. Online Learning
  • Batch/offline learning
  • SVM, logistic regression, decision trees, etc
  • Multiple passes over data
  • No incremental updates
  • Potentially high memory and processing overhead
  • Online learning
  • Perceptron-style algorithms
  • Single pass over data
  • Incremental updates
  • Low memory and processing overheard

Online learning addresses scale and
non-stationarity
12
Evaluations
  • Online learning for URL reputation
  • Need for large, fresh training sets
  • Comparing online algorithms

13
Need lots of fresh training data?
SVM trained once
SVM retrained daily
  • Fresh data helps

14
Need lots of fresh training data?
SVM trained once on 2 weeks
SVM w/ 2-week sliding window
  • Fresh data helps
  • More data helps

15
Which online algorithm?
  • Perceptron
  • Stochastic Gradient Descent for Logistic
    Regression
  • Confidence-Weighted Learning

16
Perceptron
Rosenblatt, 1958
  • Convergence result




-

radius

-
-
Number of mistakes

-
-
margin
-
  • Update on each mistake

17
Logistic Regression with SGD
Bottou, 1998
  • Log likelihood
  • For every example

where
Proportional
18
Confidence-Weighted Learning
Dredze et al., 2008 Crammer et al., 2009
  • Maintain Gaussian distribution over weight vector
  • Constrained problem
  • Closed-form update

Treat features differently
19
Which online algorithms?
Perceptron
20
Which online algorithms?
Perceptron
LR w/ SGD
  • Proportional update helps

21
Which online algorithms?
Perceptron
LR w/ SGD
Confidence-Weighted
  • Proportional update helps
  • Per-feature confidence really helps

22
Batch...
Batch
  • Fresh data helps
  • More data helps

23
Batch vs. Online
Batch
Confidence-Weighted
  • Fresh data helps
  • More data helps
  • Online matches batch

24
Conclusion
  • Detecting malicious URLs
  • Relevant real-world problem
  • Successful application of online learning
  • Confidence-Weighted vs. Batch
  • As accurate
  • More adaptive
  • Less resources
  • Future work
  • Scaling up for deployment
Write a Comment
User Comments (0)
About PowerShow.com