Comment Spam Identification - PowerPoint PPT Presentation

About This Presentation
Title:

Comment Spam Identification

Description:

'Completely Automated Public Turing test to tell Computers and ... betting. 0.000082076. 0.999918. casino 'Clean' words. 0.999572. 0.000427782. pimps. 0.998603 ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 48
Provided by: eri5155
Category:

less

Transcript and Presenter's Notes

Title: Comment Spam Identification


1
Comment Spam Identification
  • Eric Cheng Eric Steinlauf

2
What is comment spam?
3
(No Transcript)
4
(No Transcript)
5
Total spam 1,226,026,178 Total ham 62,723,306
95 are spam!
Source http//akismet.com/stats/ Retrieved
4/22/2007
6
Countermeasures
7
Blacklisting
  • 5yx.org
  • 9kx.com
  • aakl.com
  • aaql.com
  • aazl.com
  • abcwaynet.com
  • abgv.com
  • abjg.com
  • ablazeglass.com
  • abseilextreme.net
  • actionbenevole.com
  • acvt.com
  • adbx.com
  • adhouseaz.com
  • advantechmicro.com
  • aeur.com
  • aeza.com
  • agentcom.com
  • ailh.org
  • globalplasticscrap.com
  • gowest-veritas.com
  • greenlightgo.org
  • hadjimitsis.com
  • healthcarefx.com
  • herctrade.com
  • hobbyhighway.com
  • hominginc.com
  • hongkongdivas.com
  • hpspyacademy.com
  • hzlr.com
  • idlemindsonline.com
  • internetmarketingserve.com
  • jesh.org
  • jfcp.com
  • jfss.com
  • jittersjapan.com
  • jkjf.com
  • jkmrw.com

rockymountainair.org rstechresources.com samsung-i
nteger.com sandiegonhs.org screwpile.org scvend.or
g sell-in-china.com sensationalwraps.com sevierdes
ign.com starbikeshop.com struthersinc.com swarange
et.com thecorporategroup.net thehawleyco.com thehu
mancrystal.com thinkaids.org thisandthatgiftshop.n
et thomsungroup.com ti0.org timeby.net tradewindsw
f.com tradingb2c.com turkeycogroup.net vassagospal
ace.com vyoung.net web-toggery.com webedgewars.com
webshoponsalead.com webtoggery.com willman-paris.
com worldwidegoans.com
8
Captchas
  • "Completely Automated Public Turing test to tell
    Computers and Humans Apart"

9
Other ad-hoc/weak methods
  • Authentication / registration
  • Comment throttling
  • Disallowing links in comments
  • Moderation

10
Our Approach Naïve Bayes
  • Statistical
  • Adaptive
  • Automatic
  • Scalable and extensible
  • Works well for spam e-mail

11
Naïve Bayes
12
P(AB) P(B)
P(BA) P(A)
P(AB)
13
P(AB) P(B)
P(BA) P(A)
14
P(AB) P(BA) P(A) / P(B)
15
P(spamcomment) P(commentspam) P(spam) /
P(comment)
16
P(spamcomment) P(commentspam) P(spam) /
P(comment)
17
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
(naïve assumption)
18
Corpus
Incoming Comment
Texas casino
Online Texas holdem
Texas gambling site
P(Texasspam) 1 (1 2/5)3 0.784
P(w1spam) 1 (1 x/y)n
Probability of w1 occurring given a spam comment
where x is the number of times w1 appears in
all spam messages, y is the total number of
words in all spam messages, and n is the
length of the given comment
19
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
20
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
21
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
22
P(hamcomment) P(w1ham) P(w2ham)
P(wnham) P(ham) /
P(comment)
P(spamcomment) P(w1spam)
P(w2spam) P(wnspam)
P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
??????
23
P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham)
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam)
Probability of w1 occurring given a spam comment
Probability of something being spam
24
P(hamcomment) ? P(w1ham) P(w2ham)
P(wnham) P(ham))
log(
)
log(
P(spamcomment) ? P(w1spam)
P(w2spam) P(wnspam)
P(spam))
log(
)
log(
25
log(P(hamcomment)) ? log(P(w1ham))
log(P(w2ham))
log(P(wnham)) log(P(ham))
log(P(spamcomment)) ? log(P(w1spam))
log(P(w2spam))
log(P(wnspam)) log(P(spam))
26
Fact
P(spamcomment) 1 P(hamcomment)
Abuse of notation
P(s) P(spamcomment) P(h) P(hamcomment)
27
P(s) 1 P(h)
m log(P(s)) log(P(h))
log(P(s)/P(h))
em elog(P(s)/P(h))
P(s)/P(h)
em P(h) P(s)
28
P(s) 1 P(h)
em P(h) P(s)
em P(h) 1 P(h)
(em 1) P(h) 1
P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
29
P(h) 1/(em1) P(s) 1 P(h)
m log(P(s)) log(P(h))
30
P(hamcomment) 1/(em1) P(spamcomment) 1
P(hamcomment)
m log(P(spamcomment)) log(P(hamcomment))
31
In practice, just compare
log(P(hamcomment)) log(P(spamcomment))
32
Implementation
33
Corpus
  • A collection of 50 blog pages with 1024 comments
  • Manually tagged as spam/non-spam
  • 67 are spam
  • Provided by the Informatics Institute at
    University of Amsterdam

Blocking Blog Spam with Language Model
Disagreement, G. Mishne, D. Carmel, and R.
Lempel. In AIRWeb '05 - First International
Workshop on Adversarial Information Retrieval on
the Web, at the 14th International World Wide Web
Conference (WWW2005), 2005.
34
Most popular spam words
casino 0.999918 0.000082076
betting 0.999879 0.000120513
texas 0.999813 0.000187148
biz 0.999776 0.000223708
holdem 0.999738 0.000262111
poker 0.999551 0.000448675
pills 0.999527 0.000473407
pokerabc 0.999506 0.000493821
teen 0.999455 0.000544715
online 0.999455 0.000544715
bowl 0.999437 0.000562555
gambling 0.999437 0.000562555
sonneries 0.999353 0.000647359
blackjack 0.999346 0.000653516
pharmacy 0.999254 0.000745723
35
Clean words
edu 0.00287339 0.997127
projects 0.00270528 0.997295
week 0.00270528 0.997295
etc 0.00270528 0.997295
went 0.00270528 0.997295
inbox 0.00270528 0.997295
bit 0.00270528 0.997295
someone 0.00255576 0.997444
bike 0.00230136 0.997699
already 0.00230136 0.997699
selling 0.00219225 0.997808
making 0.00209302 0.997907
squad 0.00184278 0.998157
left 0.00177216 0.998228
important 0.0013973 0.998603
pimps 0.000427782 0.999572
36
Implementation
  • Corpus parsing and processing
  • Naïve Bayes algorithm
  • Randomly select 70 for training, 30 for testing
  • Stand-alone web service
  • Written entirely in Python

37
Its showtime!
38
Configurations
  • Separator used to tokenize comment
  • Inclusion of words from header
  • Classify based only on most significant words
  • Double count non-spam comments
  • Include article body as non-spam example
  • Boosting

39
Minimum Error Configuration
  • Separator a-zltgt
  • Header Both
  • Significant words All
  • Double count No
  • Include body No
  • Boosting No

40
Varying Configuration Parameters
41
Varying Configuration Parameters
42
Boosting
  • Naïve Bayes is applied repeatedly to the data.
  • Produces Weighted Majority Model

bayesModels empty list weights vector(1) for
i in 1 to M model naiveBayes(examples,
weights) error computeError(model,
examples) weights adjustWeights(examples,
weights, error) bayesModelsi model,
error if error0 break
43
Boosting
44
Future work(or what we did not do)
45
Data Processing
  • Follow links in comment and include words in
    target web page
  • More sophisticated tokenization and URL handling
    (handling 100,000...)
  • Word stemming

46
Features
  • Ability to incorporate incoming comments into
    corpus
  • Ability to mark comment as spam/non-spam
  • Assign more weight on page content
  • Adjust probability table based on page content,
    providing content-sensitive filtering

47
Comments?
No spam, please.
Write a Comment
User Comments (0)
About PowerShow.com