Title: Design and evaluation of spam blocking method based on route information
1Design and evaluation of spam blocking method
based on route information
- SHIMOKAWA Yoshitaka
- Master 2nd
- Chikayama/Taura Lab.
1
3
2
4
0
2Introduction
3Consideration on spam problem
- Spam will cost over 10 billion in 2003
- Spam is expected to cost American corporation
about 168 per employee for 2003 - 40 of e-mail traffic is spam
- unique spam attacks rose to over 5 million by
August 2002 up from 1.5 million August of 2001 - -- from ferris research
gt Social and Economic problems
0
4Position of this research
- purpose
- design of a spam filtering method
- position
- evaluate effect and adequacy of the filtering
method - final goal
- design of a system based on distributed adaptive
blacklists - novelty
- use route information instead of digest value
0
5brief of idea
Sender
A
B
D
E
C
ex) A ? F bad B ? F good
F
G
R
A ? F ? R
spam?
legit?
B ? F ? R
Receiver
0
6Overview
- Current research topics
- Design of a method
- Implementation and Evaluation
- Conclusion and Future works
0
71. Current research topics
8Common technique to avoid spam
- prevention of open relay
- limited relay
- use of blacklist
- rejection of known spam host
- reflect in local list manually
- confirmation of SMTP session validity
- SMTP cmd ex) HELO, MAIL FROM
- text filtering
- judged as spam, if specified word is in the mail
- enough effective?
- No!
- (in accuracy, cost)
1
9Classification of current research topics
About!
How?
What?
Problem?
Bayesian
multiple attributes
Header Body
K nearest neighbor
multiple attributes
Distributed adaptive blacklists
resolve by human resource
How avoid spam?
Challenge Response
negotiation between hosts
Connection (negotiation)
Reverse MX
negotiation between MTADNS
1
10Bayesian filtering
- assumption
- spam seems to have specific word like XXX,
money - consider from past spam/legit mail
- apply Bayes theorem to spam filtering
- if P(spam M) gt ?P(legit M) ? spam
- M mail (consists of a1an)
- ai word
- ? weight value (legit is more important)
- P(spam M) is calculated by P(spam ai)
1
11K nearest neighbor
k-nn in multi-dimensional space
(XXX, adult, money)
? (1, 1, 0)
Spam (training set)
Legit (training set)
? spam
1
12Characteristics of previous 2 method
- merit
- accurate enough
- speed
- demerit
- ineffective to slightly changed word
- ex) Viagra Voiagkra
- ineffective if spam includes many legitimate
words - time-consuming for training
- no contribution to reduction of spam flow
1
13Challenge/Response algorithm
C/R algorithm
Alice
Bob
W
ex) computation cost manual http access
Alice is added to whitelist
P
accepted!
Bobs MUA
1
14Characteristics of C/R algorithm
- merit
- spammer cost
- prevention of spammer disguise
- ? spam business is not profitable
- demerit
- needs user cost in certification
- ineffective now (needs diffusion)
1
15Reverse MX
- new DNS record
- new DNS record RMX
- IP/domain certification
- process
- connected MTA lookup RMX
- if connecting IP is different from IP in RMX
- ? filter (reject the mail)
1
16Characteristics of Reverse MX
- merit
- loose certification (IP/domain level)
- prevention of spammer disguise (ex. MAIL FROM)
- demerit
- no contribution to reduction of spam
- (just IP/domain disguise)
- ineffective now (needs diffusion)
1
17Distributed adaptive blacklists
- how it works
- users join collaborative anti-spam network
- share spam information globally
- merit
- very low false-positives
- accurate enough (according to network size)
- reduce each user burden
- demerit
- filtering cost use network connection
- filtering speed computation/connection cost
anti-spam network
S
A
B
1
18Filtering process
- clients (users) judge spam or not
- server save information from clients
- clients reflect shared information in local lists
- use digest value of header as information
anti-spam network
server
S
spam 1f3204d91a
A
out
local
local list
1
19Calculating process of digest value
- extract only header from mail
- read extracted header line by line
- generalize read string
- hash generalized string (update hash value)
how-to calculate digest value
ltheadergt From spam_at_spam.com To
leigt_at_legit.com Received from A
hash
123abc
update
456def
digest of this mail 1fe325a8d3913
update
ltdigestgt
ltBodygt Do you want money? You can easily earn 1
mil per year!
update
ltdigestgt
update
ltdigestgt
update
ltdigestgt
1
202. Design of a method
21Concept
Sender
Sender
from header
A
C
B
D
E
A
B
D
E
C
unique information for each spam/legit mail
idea
ex) A ? F bad B ? F good
F
G
1. user decision spam/legit
R
A ? F ? R
spam?
F
G
B ? F ? R
legit?
2. extract route information
Receiver
3. share this information
R
A ? F ? R
spam?
legit?
B ? F ? R
Receiver
2
22Characteristics
- merit
- contribution to reduction of spam flow
- respond to new spam by updating (distributed
adaptive blacklists) - low cost compared with existing cost-consuming
open blacklist - demerit
- filtering cost network connection cost
- filtering speed slow
23Received in header
e.g. header
host in by receive mail from host in from
? route information
2
24Why route information ?
- legitimate mail
- direct connection ? a few Received tag
- .forward, mailing list ? more Received tag
- unique to user nearest MTA
- spam mail
- open relay (indirect) ? more Received tag
- Received tag can be disguised
- unique to spammer spammer MTA
different characteristics ? apply to filtering
2
25Edge in route information
- 3 way to share route information
- entire route information
- point (each MTA server)
- edge ex) (x, y) from x, by y
- example (right fig.)
- edge A ? B, B ? C, C ? D
- calculate spam ratio from shared lists
- result max spam ratio
spam ratio in edge
A
flow volume
spam ratio
0.7
5
B
spam edge
0.99
100
C
0.4
1000
This mail is spam!
D
2
263. Implementation and Evaluation
27Implementation
- create header file
- extract route information from Received
- output as header file (training/test set)
- create training set
- extract edge from header file
- output as edge file
- evaluation with test set
- check if each edge in test set is in edge file
- flow volume/spam rate ?
HeaderClassifier module
training set
test set
RouteParser module
edge file
SpamTester module
result
3
28HeaderClassifier module
e.g. header file
ltheadergt
ltbodygt
3
29RouteParser module
- create edge file
- extract from header file
- output as edge file
- element in edge file
- edge name A to B
- count
- spam count 10
- legit count 90
- flow volume 100
- spam ratio 0.1
example
-- snip -- Received from B by A Received from C
by B -- snip --
A
spam10 / legit90
B
3
30SpamTester module
- evaluation process
- check if each edge in test set is also in edge
file (training) - flow volume/spam ratio gt threshold?
- classification of result
- Black (spam) / White (legit) / Grey (unsure)
- evaluation criteria
- is spam and the result is spam
- is legit but the result is spam
- is spam but the result is legit
- is legit and the result is legit
? spam or legit
3
31Evaluation experiment
- data
- spam/legit is by user decision
- training set legit 7500, spam 7500
- test set legit 500, spam 500
- evaluation process
- training set (nt) 1000 15000
- flow volume (tf) 10,20,30,40,50,100
- spam ratio (tr) 0.9, 0.95, 0.97, 0.99, 1.0
3
32(false) detection rate to nt (tf 30)
spam
spam ratio
spam ratio
legit
- detection rate becomes higher according to
training set size - 0 false detection rate
- a mount of spam and legit flows in some edge
3
33(false) detection rate to tr (nt 15000)
spam
flow volume
flow volume
legit
- detection rate becomes lower with size of spam
ratio value - 0 false detection rate
- false detection rate becomes lower with size of
flow volume
3
34Black/white rate to nt (tf 30)
spam
spam ratio
spam ratio
legit
- black/white if gt tf
- grey if lt tf
- unignorable amount of grey
3
35Example of edges (from real data)
- is spam and the result is spam
- from 81.96.136.126
- by 133.11.8.7(ms.is.s.u-tokyo.ac.jp)
- spam ratio 1.0
- is legit but the result is spam
- from 133.11.12.9(venus.is.s.u-tokyo.ac.jp)
- by 133.11.79.31(camel.logos.t.u-tokyo.ac.jp)
- spam ratio 0.9334
- is legit and the result is legit
- from rpsmtp1.aist.go.jp
- by 150.29.246.133
- spam ratio 0.516
spam edge
bad for filtering accuracy
0.96
unused account
foolish mailing list
3
36Summary
- in tr 0.90,0.95, detection rate seems accurate
but many legitimate edges is used for evaluation
(can be said from high false positive) - in tr 0.97,0.99, detection rate is about 10
30 ? valid spam edge is about 30 of total at
the very most - in tr 1.0, detection rate is about 10 ?
accuracy is not so good. But effective to thick
spam edge - in tr 0.97,0.99,1.0, no false positive (false
detection rate 0) ? a important merit of this
method - detection rate varies about a few times by
training sets in percentage ? network size is a
key factor - grey edge percentage of total is about 20 40 ?
many of edges are narrow and spam
3
374. Conclusion and Future Works
38Conclusion
- design of a method based or route information
- use edge instead of digest value
- contribute to spam flow reduction
- evaluation of the method
- effective against thick and spam edge
- 0 false detection rate in some conditions
4
39Some consideration on further development 1
more precise filtering
A
B
D
E
C
H
F
G
white
J
I
K
C ? G G ? J
white
white
white
R
4
40Some consideration on further development 2
more precise filtering
ltnarrow edgesgt host1.someisp.com ? a
host host2.someisp.com ? a host host3.someisp.com
? a host host4.someisp.com ? a host host5.someisp.
com ? a host
someisp.com
Many narrow spam edges!
someisp.com ? a host
4
41Some consideration on further development 3
reduction of spam flow
A
B
D
E
C
spam edge
H
F
G
reject
J
I
K
R
4
42End of presentation
Thank you!
Thank you!
43mertzs experiment
- experiment by David Mertz
- Good Corpus Spam Corpus
- "The Truth" 1851 / 0 0 / 1916
- Trigram Model 1849 / 2 142 / 1774
- Word Model 1847 / 4 97 / 1819
- SpamAssassin 1846 / 5 358 / 1558
- Pyzor 1847 / 0 (4 err) 971 / 943 (2 err)
44Dist
Thank you!
Thank you!
45for concept making
Sender
from header
A
C
B
D
E
unique information for each spam/legit mail
idea
F
G
1. user decision spam/legit
R
A ? F ? R
spam?
B ? F ? R
legit?
2. extract route information
Receiver
3. share this information
46Three way to share route information fig.
edge
point
legit user
spammer
legit user
spammer
legit MTA a
spam MTA b
spam MTA b
legit MTA a
open relay MTA c
open relay MTA c
via b-c reject
via c reject
via a-c accept
47Evaluation method
- create header file
- extract route information from Received
- output as header file (training/test set)
- create training set
- extract edge from header file
- output as edge file
- evaluation with test set
- check if each edge in test set is in edge file
- flow volume/spam rate ?
3
48Future works
- further development
- ex. make whitelist for valid edges
- final goal
- implementation of system
- ? applying proposed method to distributed
adaptive blacklists
4