Know your Neighbors: Web Spam Detection Using the Web Topology - PowerPoint PPT Presentation

About This Presentation
Title:

Know your Neighbors: Web Spam Detection Using the Web Topology

Description:

Know your Neighbors: Web Spam Detection Using the Web Topology Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa Murdock(1), Fabrizio Silvestri(2). – PowerPoint PPT presentation

Number of Views:334
Avg rating:3.0/5.0
Slides: 27
Provided by: Soumo2
Category:

less

Transcript and Presenter's Notes

Title: Know your Neighbors: Web Spam Detection Using the Web Topology


1
Know your NeighborsWeb Spam Detection Using the
Web Topology
Carlos Castillo(1), Debora Donato(1), Aristides
Gionis(1), Vanessa Murdock(1), Fabrizio
Silvestri(2). 1. Yahoo! Research Barcelona
Catalunya, Spain 2. ISTI-CNR Pisa,Italy ACM
SIGIR, 25 July 2007, Amsterdam
  • Presented By,
  • SOUMO GORAI

2
Soumos Biography
  • 4th Year CS Major
  • Graduating May 2008
  • Interesting About Me Lived in India, Australia,
    and the U.S.
  • CS Interests Databases, HCI, Web Programming,
    Networking,
  • Graphics, Gaming,
  • .

3
Heres all that you can find on the web.
NOT!
4
Heres just some of what really is out there
5
And more.
6
Why so many different things?
There is a fierce competition for your attention!
Ease of publication for personal publication as
well as commercial publication, advertisements,
and economic activity.
and theres lots lots lots lotslots of spam!
7
Whats Spam?!
8
Hidden Text
9
Only hidden text? Heres a whole fake search
engine!!!
10
Why is Spam bad?
  • Costs
  • Costs for users lower precision for some
    queries
  • Costs for search engines wasted storage space,
    network resources, and processing cycles
  • Costs for the publishers resources invested in
    cheating and not in improving their contents

Every undeserved gain in ranking for a spammer is
a loss of search precision for the search engine.
11
How Do We Detect Spam?
  • Machine Learning/Training
  • Link-based Detection
  • Content-based Detection
  • Using Links and Contents
  • Using Web-based Topology

12
Machine Learning/Training
13
ML Challenges
  • Machine Learning Challenges
  • Instances are not really independent (graph)
  • Training set is relatively small
  • Information Retrieval Challenges
  • It is hard to find out which features are
    relevant
  • It is hard for search engines to provide labeled
    data
  • Even if they do, it will not reflect a consensus
    on what is Web Spam

14
Link-based Detection
Single-level farms can be detected by searching
groups of nodes sharing their out-links Gibson
et al., 2005
15
Why use it?
  • Degree-related measures
  • PageRank
  • TrustRank Gyongyi et al., 2004
  • Truncated PageRank Becchetti et al., 2006
  • similar to PageRank, it limits a page to the
    PageRank score
  • of its close neighbors. Thus, the Truncated
    PageRank score
  • is a useful feature for spam detection because
    spam pages
  • generally try to reinforce their PageRank scores
    by linking
  • to each other.

16
Degree-based
Measures are related to in-degree and
out-degree Edge-reciprocity (the number of links
that are reciprocal)
Assortativity (the ratio between the degree of a
particular page and the average degree of its
neighbors
17
TrustRank / PageRank
TrustRank an algorithm that picks trusted nodes
derived from page-ranks but tests the degree of
relationship one page has with other known
trusted pages. This is given a TrustRank score.
Ratio between TrustRank and Page Rank
Number of home pages.
Cons this alone is not sufficient as there are
many false positives.
18
Content-based Detection
  • Most of the features reported in Ntoulas et al.,
    2006
  • Number of words in the page and title
  • Average word length
  • Fraction of anchor text
  • Fraction of visible text
  • Compression rate
  • Corpus precision and corpus recall
  • Query precision and query recall
  • Independent trigram likelihood
  • Entropy of trigrams

19
Corpus and Query
  • F set of most frequent terms in the collection
  • Q set of most frequent terms in a query log
  • P set of terms in a page
  • Computation Techniques

corpus precision the fraction of words(except
stopwords) in a page that appear in the set of
popular terms of a data collection. corpus
recall the fraction of popular terms of the data
collection that appear in the page. query
precision the fraction of words in a page that
appear in the set of q most popular terms
appearing in a query log. query recall the
fraction of q most popular terms of the query log
that appear in the page.
20
Visual Clues
Figure Histogram of the corpus precision in
non-spam vs. spam pages.
Figure Histogram of the average word length in
non-spam vs. spam pages for k 500.
Figure Histogram of the query precision in
non-spam vs. spam pages for k 500.
21
Links AND Contents Detection
Why Both?
22
Web Topology Detection
  • Pages topologically close to each other are more
    likely to have the same label (spam/nonspam) than
    random pairs of pages.
  • Pages linked together are more likely to be on
    the same topic than random pairs of pages
    Davison, 2000
  • Spam tends to be clustered on the Web (black on
    figure)

23
Topological dependencies in-links
Let SOUT(x) be the fraction of spam hosts linked
by host x out of all labeled hosts linked by host
x. This figure shows the histogram of SOUT for
spam and non-spam hosts. We see that almost all
non-spam hosts link mostly to non-spam hosts.
Let SIN(x) be the fraction of spam hosts that
link to host x out of all labeled hosts that link
to x. This figure shows the histograms of SINfor
spam and non-spam hosts.In this case there is a
clear separation between spam and non-spam hosts.
24
Clustering
if the majority of a cluster is predicted to be
spam then we change the prediction for all hosts
in the cluster to spam. The inverse holds true
too.
25
Article Critique
  • Pros
  • Has detailed descriptions of various detection
    mechanisms.
  • Integrates link and content attributes for
    building a system to detect Web spam
  • Cons
  • Statistics and success rate for other
    content-based detection techniques.
  • Some graphs had axis labels missing.

Extension combine the regularization (any method
of preventing overfitting of data by a model)
methods at hand in order to improve the overall
accuracy
26
Summary
How Do We Detect Spam?
Why is Spam bad?
  • Costs
  • Costs for users lower precision for some queries
  • Costs for search engines wasted storage space,
    network resources, and processing cycles
  • Costs for the publishers resources invested in
    cheating and not in improving their contents
  • Machine Learning/Training
  • Link-based Detection
  • Content-based Detection
  • Using Links and Contents
  • Using Web-based Topology

Every undeserved gain in ranking for a spammer,
is a loss of precision for the search engine.
Write a Comment
User Comments (0)
About PowerShow.com