Design and evaluation of spam blocking method based on route information - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Design and evaluation of spam blocking method based on route information

Description:

confirmation of SMTP session validity. SMTP cmd ... ex) HELO, ... Filtering process. clients (users) judge spam or not. server save information from clients ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 43
Provided by: logosIcI
Category:

less

Transcript and Presenter's Notes

Title: Design and evaluation of spam blocking method based on route information


1
Design and evaluation of spam blocking method
based on route information
  • SHIMOKAWA Yoshitaka
  • Master 2nd
  • Chikayama/Taura Lab.

1
3
2
4
0
2
Introduction
3
Consideration on spam problem
  • Spam will cost over 10 billion in 2003
  • Spam is expected to cost American corporation
    about 168 per employee for 2003
  • 40 of e-mail traffic is spam
  • unique spam attacks rose to over 5 million by
    August 2002 up from 1.5 million August of 2001
  • -- from ferris research

gt Social and Economic problems
0
4
Position of this research
  • purpose
  • design of a spam filtering method
  • position
  • evaluate effect and adequacy of the filtering
    method
  • final goal
  • design of a system based on distributed adaptive
    blacklists
  • novelty
  • use route information instead of digest value

0
5
brief of idea
Sender
A
B
D
E
C
ex) A ? F bad B ? F good
F
G
R
A ? F ? R
spam?
legit?
B ? F ? R
Receiver
0
6
Overview
  • Current research topics
  • Design of a method
  • Implementation and Evaluation
  • Conclusion and Future works

0
7
1. Current research topics
8
Common technique to avoid spam
  • prevention of open relay
  • limited relay
  • use of blacklist
  • rejection of known spam host
  • reflect in local list manually
  • confirmation of SMTP session validity
  • SMTP cmd ex) HELO, MAIL FROM
  • text filtering
  • judged as spam, if specified word is in the mail
  • enough effective?
  • No!
  • (in accuracy, cost)

1
9
Classification of current research topics
About!
How?
What?
Problem?
Bayesian
multiple attributes
Header Body
K nearest neighbor
multiple attributes
Distributed adaptive blacklists
resolve by human resource
How avoid spam?
Challenge Response
negotiation between hosts
Connection (negotiation)
Reverse MX
negotiation between MTADNS
1
10
Bayesian filtering
  • assumption
  • spam seems to have specific word like XXX,
    money
  • consider from past spam/legit mail
  • apply Bayes theorem to spam filtering
  • if P(spam M) gt ?P(legit M) ? spam
  • M mail (consists of a1an)
  • ai word
  • ? weight value (legit is more important)
  • P(spam M) is calculated by P(spam ai)

1
11
K nearest neighbor
k-nn in multi-dimensional space
(XXX, adult, money)
? (1, 1, 0)
Spam (training set)
Legit (training set)
? spam
1
12
Characteristics of previous 2 method
  • merit
  • accurate enough
  • speed
  • demerit
  • ineffective to slightly changed word
  • ex) Viagra Voiagkra
  • ineffective if spam includes many legitimate
    words
  • time-consuming for training
  • no contribution to reduction of spam flow

1
13
Challenge/Response algorithm
C/R algorithm
Alice
Bob
W
ex) computation cost manual http access
Alice is added to whitelist
P
accepted!
Bobs MUA
1
14
Characteristics of C/R algorithm
  • merit
  • spammer cost
  • prevention of spammer disguise
  • ? spam business is not profitable
  • demerit
  • needs user cost in certification
  • ineffective now (needs diffusion)

1
15
Reverse MX
  • new DNS record
  • new DNS record RMX
  • IP/domain certification
  • process
  • connected MTA lookup RMX
  • if connecting IP is different from IP in RMX
  • ? filter (reject the mail)

1
16
Characteristics of Reverse MX
  • merit
  • loose certification (IP/domain level)
  • prevention of spammer disguise (ex. MAIL FROM)
  • demerit
  • no contribution to reduction of spam
  • (just IP/domain disguise)
  • ineffective now (needs diffusion)

1
17
Distributed adaptive blacklists
  • how it works
  • users join collaborative anti-spam network
  • share spam information globally
  • merit
  • very low false-positives
  • accurate enough (according to network size)
  • reduce each user burden
  • demerit
  • filtering cost use network connection
  • filtering speed computation/connection cost

anti-spam network
S
A
B
1
18
Filtering process
  • clients (users) judge spam or not
  • server save information from clients
  • clients reflect shared information in local lists
  • use digest value of header as information

anti-spam network
server
S
spam 1f3204d91a
A
out
local
local list
1
19
Calculating process of digest value
  • extract only header from mail
  • read extracted header line by line
  • generalize read string
  • hash generalized string (update hash value)

how-to calculate digest value
ltheadergt From spam_at_spam.com To
leigt_at_legit.com Received from A
hash
123abc
update
456def
digest of this mail 1fe325a8d3913
update
ltdigestgt
ltBodygt Do you want money? You can easily earn 1
mil per year!
update
ltdigestgt
update
ltdigestgt
update
ltdigestgt
1
20
2. Design of a method
21
Concept
Sender
Sender
from header
A
C
B
D
E
A
B
D
E
C
unique information for each spam/legit mail
idea
ex) A ? F bad B ? F good
F
G
1. user decision spam/legit
R
A ? F ? R
spam?
F
G
B ? F ? R
legit?
2. extract route information
Receiver
3. share this information
R
A ? F ? R
spam?
legit?
B ? F ? R
Receiver
2
22
Characteristics
  • merit
  • contribution to reduction of spam flow
  • respond to new spam by updating (distributed
    adaptive blacklists)
  • low cost compared with existing cost-consuming
    open blacklist
  • demerit
  • filtering cost network connection cost
  • filtering speed slow

23
Received in header
e.g. header
host in by receive mail from host in from
? route information
2
24
Why route information ?
  • legitimate mail
  • direct connection ? a few Received tag
  • .forward, mailing list ? more Received tag
  • unique to user nearest MTA
  • spam mail
  • open relay (indirect) ? more Received tag
  • Received tag can be disguised
  • unique to spammer spammer MTA

different characteristics ? apply to filtering
2
25
Edge in route information
  • 3 way to share route information
  • entire route information
  • point (each MTA server)
  • edge ex) (x, y) from x, by y
  • example (right fig.)
  • edge A ? B, B ? C, C ? D
  • calculate spam ratio from shared lists
  • result max spam ratio

spam ratio in edge
A
flow volume
spam ratio
0.7
5
B
spam edge
0.99
100
C
0.4
1000
This mail is spam!
D
2
26
3. Implementation and Evaluation
27
Implementation
  • create header file
  • extract route information from Received
  • output as header file (training/test set)
  • create training set
  • extract edge from header file
  • output as edge file
  • evaluation with test set
  • check if each edge in test set is in edge file
  • flow volume/spam rate ?

HeaderClassifier module
training set
test set
RouteParser module
edge file
SpamTester module
result
3
28
HeaderClassifier module
e.g. header file
ltheadergt
ltbodygt
3
29
RouteParser module
  • create edge file
  • extract from header file
  • output as edge file
  • element in edge file
  • edge name A to B
  • count
  • spam count 10
  • legit count 90
  • flow volume 100
  • spam ratio 0.1

example
-- snip -- Received from B by A Received from C
by B -- snip --
A
spam10 / legit90
B
3
30
SpamTester module
  • evaluation process
  • check if each edge in test set is also in edge
    file (training)
  • flow volume/spam ratio gt threshold?
  • classification of result
  • Black (spam) / White (legit) / Grey (unsure)
  • evaluation criteria
  • is spam and the result is spam
  • is legit but the result is spam
  • is spam but the result is legit
  • is legit and the result is legit

? spam or legit
3
31
Evaluation experiment
  • data
  • spam/legit is by user decision
  • training set legit 7500, spam 7500
  • test set legit 500, spam 500
  • evaluation process
  • training set (nt) 1000 15000
  • flow volume (tf) 10,20,30,40,50,100
  • spam ratio (tr) 0.9, 0.95, 0.97, 0.99, 1.0

3
32
(false) detection rate to nt (tf 30)
spam
spam ratio
spam ratio
legit
  • detection rate becomes higher according to
    training set size
  • 0 false detection rate
  • a mount of spam and legit flows in some edge

3
33
(false) detection rate to tr (nt 15000)
spam
flow volume
flow volume
legit
  • detection rate becomes lower with size of spam
    ratio value
  • 0 false detection rate
  • false detection rate becomes lower with size of
    flow volume

3
34
Black/white rate to nt (tf 30)
spam
spam ratio
spam ratio
legit
  • black/white if gt tf
  • grey if lt tf
  • unignorable amount of grey

3
35
Example of edges (from real data)
  • is spam and the result is spam
  • from 81.96.136.126
  • by 133.11.8.7(ms.is.s.u-tokyo.ac.jp)
  • spam ratio 1.0
  • is legit but the result is spam
  • from 133.11.12.9(venus.is.s.u-tokyo.ac.jp)
  • by 133.11.79.31(camel.logos.t.u-tokyo.ac.jp)
  • spam ratio 0.9334
  • is legit and the result is legit
  • from rpsmtp1.aist.go.jp
  • by 150.29.246.133
  • spam ratio 0.516

spam edge
bad for filtering accuracy
0.96
unused account
foolish mailing list
3
36
Summary
  • in tr 0.90,0.95, detection rate seems accurate
    but many legitimate edges is used for evaluation
    (can be said from high false positive)
  • in tr 0.97,0.99, detection rate is about 10
    30 ? valid spam edge is about 30 of total at
    the very most
  • in tr 1.0, detection rate is about 10 ?
    accuracy is not so good. But effective to thick
    spam edge
  • in tr 0.97,0.99,1.0, no false positive (false
    detection rate 0) ? a important merit of this
    method
  • detection rate varies about a few times by
    training sets in percentage ? network size is a
    key factor
  • grey edge percentage of total is about 20 40 ?
    many of edges are narrow and spam

3
37
4. Conclusion and Future Works
38
Conclusion
  • design of a method based or route information
  • use edge instead of digest value
  • contribute to spam flow reduction
  • evaluation of the method
  • effective against thick and spam edge
  • 0 false detection rate in some conditions

4
39
Some consideration on further development 1
more precise filtering
A
B
D
E
C
H
F
G
white
J
I
K
C ? G G ? J
white
white
white
R
4
40
Some consideration on further development 2
more precise filtering
ltnarrow edgesgt host1.someisp.com ? a
host host2.someisp.com ? a host host3.someisp.com
? a host host4.someisp.com ? a host host5.someisp.
com ? a host
someisp.com
Many narrow spam edges!
someisp.com ? a host
4
41
Some consideration on further development 3
reduction of spam flow
A
B
D
E
C
spam edge
H
F
G
reject
J
I
K
R
4
42
End of presentation
Thank you!
Thank you!
43
mertzs experiment
  • experiment by David Mertz
  • Good Corpus Spam Corpus
  • "The Truth" 1851 / 0 0 / 1916
  • Trigram Model 1849 / 2 142 / 1774
  • Word Model 1847 / 4 97 / 1819
  • SpamAssassin 1846 / 5 358 / 1558
  • Pyzor 1847 / 0 (4 err) 971 / 943 (2 err)

44
Dist
Thank you!
Thank you!
45
for concept making
Sender
from header
A
C
B
D
E
unique information for each spam/legit mail
idea
F
G
1. user decision spam/legit
R
A ? F ? R
spam?
B ? F ? R
legit?
2. extract route information
Receiver
3. share this information
46
Three way to share route information fig.
edge
point
legit user
spammer
legit user
spammer
legit MTA a
spam MTA b
spam MTA b
legit MTA a
open relay MTA c
open relay MTA c
via b-c reject
via c reject
via a-c accept
47
Evaluation method
  • create header file
  • extract route information from Received
  • output as header file (training/test set)
  • create training set
  • extract edge from header file
  • output as edge file
  • evaluation with test set
  • check if each edge in test set is in edge file
  • flow volume/spam rate ?

3
48
Future works
  • further development
  • ex. make whitelist for valid edges
  • final goal
  • implementation of system
  • ? applying proposed method to distributed
    adaptive blacklists

4
Write a Comment
User Comments (0)
About PowerShow.com