Know your Neighbors: Web Spam Detection Using the Web Topology - PowerPoint PPT Presentation

About This Presentation

Title:

Know your Neighbors: Web Spam Detection Using the Web Topology

Description:

Know your Neighbors: Web Spam Detection Using the Web Topology Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa Murdock(1), Fabrizio Silvestri(2). – PowerPoint PPT presentation

Number of Views:334

Avg rating:3.0/5.0

Slides: 27

Provided by: Soumo2

Learn more at: https://faculty.cc.gatech.edu

Category:

more less

Transcript and Presenter's Notes

Title: Know your Neighbors: Web Spam Detection Using the Web Topology

1
Know your NeighborsWeb Spam Detection Using the
Web Topology
Carlos Castillo(1), Debora Donato(1), Aristides
Gionis(1), Vanessa Murdock(1), Fabrizio
Silvestri(2). 1. Yahoo! Research Barcelona
Catalunya, Spain 2. ISTI-CNR Pisa,Italy ACM
SIGIR, 25 July 2007, Amsterdam

Presented By,
SOUMO GORAI

2
Soumos Biography

4th Year CS Major
Graduating May 2008
Interesting About Me Lived in India, Australia,
and the U.S.
CS Interests Databases, HCI, Web Programming,
Networking,
Graphics, Gaming,
.

3
Heres all that you can find on the web.
NOT!
4
Heres just some of what really is out there
5
And more.
6
Why so many different things?
There is a fierce competition for your attention!
Ease of publication for personal publication as
well as commercial publication, advertisements,
and economic activity.
and theres lots lots lots lotslots of spam!
7
Whats Spam?!
8
Hidden Text
9
Only hidden text? Heres a whole fake search
engine!!!
10
Why is Spam bad?

Costs
Costs for users lower precision for some
queries
Costs for search engines wasted storage space,
network resources, and processing cycles
Costs for the publishers resources invested in
cheating and not in improving their contents

Every undeserved gain in ranking for a spammer is
a loss of search precision for the search engine.
11
How Do We Detect Spam?

Machine Learning/Training
Link-based Detection
Content-based Detection
Using Links and Contents
Using Web-based Topology

12
Machine Learning/Training
13
ML Challenges

Machine Learning Challenges
Instances are not really independent (graph)
Training set is relatively small
Information Retrieval Challenges
It is hard to find out which features are
relevant
It is hard for search engines to provide labeled
data
Even if they do, it will not reflect a consensus
on what is Web Spam

14
Link-based Detection
Single-level farms can be detected by searching
groups of nodes sharing their out-links Gibson
et al., 2005
15
Why use it?

Degree-related measures
PageRank
TrustRank Gyongyi et al., 2004
Truncated PageRank Becchetti et al., 2006
similar to PageRank, it limits a page to the
PageRank score
of its close neighbors. Thus, the Truncated
PageRank score
is a useful feature for spam detection because
spam pages
generally try to reinforce their PageRank scores
by linking
to each other.

16
Degree-based
Measures are related to in-degree and
out-degree Edge-reciprocity (the number of links
that are reciprocal)
Assortativity (the ratio between the degree of a
particular page and the average degree of its
neighbors
17
TrustRank / PageRank
TrustRank an algorithm that picks trusted nodes
derived from page-ranks but tests the degree of
relationship one page has with other known
trusted pages. This is given a TrustRank score.
Ratio between TrustRank and Page Rank
Number of home pages.
Cons this alone is not sufficient as there are
many false positives.
18
Content-based Detection

Most of the features reported in Ntoulas et al.,
2006
Number of words in the page and title
Average word length
Fraction of anchor text
Fraction of visible text
Compression rate
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams

19
Corpus and Query

F set of most frequent terms in the collection
Q set of most frequent terms in a query log
P set of terms in a page
Computation Techniques

corpus precision the fraction of words(except
stopwords) in a page that appear in the set of
popular terms of a data collection. corpus
recall the fraction of popular terms of the data
collection that appear in the page. query
precision the fraction of words in a page that
appear in the set of q most popular terms
appearing in a query log. query recall the
fraction of q most popular terms of the query log
that appear in the page.
20
Visual Clues
Figure Histogram of the corpus precision in
non-spam vs. spam pages.
Figure Histogram of the average word length in
non-spam vs. spam pages for k 500.
Figure Histogram of the query precision in
non-spam vs. spam pages for k 500.
21
Links AND Contents Detection
Why Both?
22
Web Topology Detection

Pages topologically close to each other are more
likely to have the same label (spam/nonspam) than
random pairs of pages.
Pages linked together are more likely to be on
the same topic than random pairs of pages
Davison, 2000
Spam tends to be clustered on the Web (black on
figure)

23
Topological dependencies in-links
Let SOUT(x) be the fraction of spam hosts linked
by host x out of all labeled hosts linked by host
x. This figure shows the histogram of SOUT for
spam and non-spam hosts. We see that almost all
non-spam hosts link mostly to non-spam hosts.
Let SIN(x) be the fraction of spam hosts that
link to host x out of all labeled hosts that link
to x. This figure shows the histograms of SINfor
spam and non-spam hosts.In this case there is a
clear separation between spam and non-spam hosts.
24
Clustering
if the majority of a cluster is predicted to be
spam then we change the prediction for all hosts
in the cluster to spam. The inverse holds true
too.
25
Article Critique

Pros
Has detailed descriptions of various detection
mechanisms.
Integrates link and content attributes for
building a system to detect Web spam

Cons
Statistics and success rate for other
content-based detection techniques.
Some graphs had axis labels missing.

Extension combine the regularization (any method
of preventing overfitting of data by a model)
methods at hand in order to improve the overall
accuracy
26
Summary
How Do We Detect Spam?
Why is Spam bad?

Costs
Costs for users lower precision for some queries
Costs for search engines wasted storage space,
network resources, and processing cycles
Costs for the publishers resources invested in
cheating and not in improving their contents