Title: Know your Neighbors: Web Spam Detection Using the Web Topology
1Know your NeighborsWeb Spam Detection Using the
Web Topology
Carlos Castillo(1), Debora Donato(1), Aristides
Gionis(1), Vanessa Murdock(1), Fabrizio
Silvestri(2). 1. Yahoo! Research Barcelona
Catalunya, Spain 2. ISTI-CNR Pisa,Italy ACM
SIGIR, 25 July 2007, Amsterdam
- Presented By,
- SOUMO GORAI
2Soumos Biography
- 4th Year CS Major
- Graduating May 2008
- Interesting About Me Lived in India, Australia,
and the U.S. - CS Interests Databases, HCI, Web Programming,
Networking, - Graphics, Gaming,
- .
3Heres all that you can find on the web.
NOT!
4Heres just some of what really is out there
5And more.
6Why so many different things?
There is a fierce competition for your attention!
Ease of publication for personal publication as
well as commercial publication, advertisements,
and economic activity.
and theres lots lots lots lotslots of spam!
7Whats Spam?!
8Hidden Text
9Only hidden text? Heres a whole fake search
engine!!!
10Why is Spam bad?
- Costs
- Costs for users lower precision for some
queries - Costs for search engines wasted storage space,
network resources, and processing cycles - Costs for the publishers resources invested in
cheating and not in improving their contents
Every undeserved gain in ranking for a spammer is
a loss of search precision for the search engine.
11How Do We Detect Spam?
- Machine Learning/Training
- Link-based Detection
- Content-based Detection
- Using Links and Contents
- Using Web-based Topology
12Machine Learning/Training
13ML Challenges
- Machine Learning Challenges
- Instances are not really independent (graph)
- Training set is relatively small
- Information Retrieval Challenges
- It is hard to find out which features are
relevant - It is hard for search engines to provide labeled
data - Even if they do, it will not reflect a consensus
on what is Web Spam
14Link-based Detection
Single-level farms can be detected by searching
groups of nodes sharing their out-links Gibson
et al., 2005
15Why use it?
- Degree-related measures
- PageRank
- TrustRank Gyongyi et al., 2004
- Truncated PageRank Becchetti et al., 2006
- similar to PageRank, it limits a page to the
PageRank score - of its close neighbors. Thus, the Truncated
PageRank score - is a useful feature for spam detection because
spam pages - generally try to reinforce their PageRank scores
by linking - to each other.
16Degree-based
Measures are related to in-degree and
out-degree Edge-reciprocity (the number of links
that are reciprocal)
Assortativity (the ratio between the degree of a
particular page and the average degree of its
neighbors
17TrustRank / PageRank
TrustRank an algorithm that picks trusted nodes
derived from page-ranks but tests the degree of
relationship one page has with other known
trusted pages. This is given a TrustRank score.
Ratio between TrustRank and Page Rank
Number of home pages.
Cons this alone is not sufficient as there are
many false positives.
18Content-based Detection
- Most of the features reported in Ntoulas et al.,
2006 - Number of words in the page and title
- Average word length
- Fraction of anchor text
- Fraction of visible text
- Compression rate
- Corpus precision and corpus recall
- Query precision and query recall
- Independent trigram likelihood
- Entropy of trigrams
19Corpus and Query
- F set of most frequent terms in the collection
- Q set of most frequent terms in a query log
- P set of terms in a page
- Computation Techniques
corpus precision the fraction of words(except
stopwords) in a page that appear in the set of
popular terms of a data collection. corpus
recall the fraction of popular terms of the data
collection that appear in the page. query
precision the fraction of words in a page that
appear in the set of q most popular terms
appearing in a query log. query recall the
fraction of q most popular terms of the query log
that appear in the page.
20Visual Clues
Figure Histogram of the corpus precision in
non-spam vs. spam pages.
Figure Histogram of the average word length in
non-spam vs. spam pages for k 500.
Figure Histogram of the query precision in
non-spam vs. spam pages for k 500.
21Links AND Contents Detection
Why Both?
22Web Topology Detection
- Pages topologically close to each other are more
likely to have the same label (spam/nonspam) than
random pairs of pages. - Pages linked together are more likely to be on
the same topic than random pairs of pages
Davison, 2000 - Spam tends to be clustered on the Web (black on
figure)
23Topological dependencies in-links
Let SOUT(x) be the fraction of spam hosts linked
by host x out of all labeled hosts linked by host
x. This figure shows the histogram of SOUT for
spam and non-spam hosts. We see that almost all
non-spam hosts link mostly to non-spam hosts.
Let SIN(x) be the fraction of spam hosts that
link to host x out of all labeled hosts that link
to x. This figure shows the histograms of SINfor
spam and non-spam hosts.In this case there is a
clear separation between spam and non-spam hosts.
24Clustering
if the majority of a cluster is predicted to be
spam then we change the prediction for all hosts
in the cluster to spam. The inverse holds true
too.
25Article Critique
- Pros
- Has detailed descriptions of various detection
mechanisms. - Integrates link and content attributes for
building a system to detect Web spam
- Cons
- Statistics and success rate for other
content-based detection techniques. - Some graphs had axis labels missing.
Extension combine the regularization (any method
of preventing overfitting of data by a model)
methods at hand in order to improve the overall
accuracy
26Summary
How Do We Detect Spam?
Why is Spam bad?
- Costs
- Costs for users lower precision for some queries
- Costs for search engines wasted storage space,
network resources, and processing cycles - Costs for the publishers resources invested in
cheating and not in improving their contents
- Machine Learning/Training
- Link-based Detection
- Content-based Detection
- Using Links and Contents
- Using Web-based Topology
Every undeserved gain in ranking for a spammer,
is a loss of precision for the search engine.