Know your Neighbors: Web Spam Detection using the Web Topology - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Know your Neighbors: Web Spam Detection using the Web Topology

Description:

web spam is a malicious attempt to influence the outcome of ranking ... Bagging improved our results by reducing the false-positive rate, as shown in Table 3. ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 26
Provided by: anthonyeo
Category:

less

Transcript and Presenter's Notes

Title: Know your Neighbors: Web Spam Detection using the Web Topology


1
Know your Neighbors Web Spam Detection using
the Web Topology
Carlos Castillo, Debora Donato, Aristides
Gionis Yahoo! Research Barcelona, SPAIN
  • SIGIR07

2
Outline
  • Introduction
  • Previous work
  • Dataset and Framework
  • Attributes
  • Classifiers
  • Smoothing
  • Conclusions

3
Introduction
  • What is web spam?
  • web spam is a malicious attempt to influence the
    outcome of ranking algorithms, for the purpose of
    getting an high rank from search engine.
  • What is the influence of web spam?
  • damages the reputation of search engines
  • weakens the trust of its users.

4
Introduction
  • Our main contributions
  • the first paper that integrates link and content
    attributes for building a system to detect Web
    spam
  • We investigate the use of a cost sensitive
    classifier in this classification problem
  • We demonstrate improvements in the classification
    accuracy using dependencies among neighboring
    hosts

5
Previous work
  • Previous work on Web spam detection has
    focused mostly on the detection of three types of
    Web spam
  • Link spam creation of a link structure aimed at
    affecting the outcome of a link-based ranking
    algorithm
  • Content spam maliciously crafting the content of
    Web pages
  • Cloaking sending different content to a search
    engine than to the regular visitors of a web site

6
Dataset
We use the publicly available WEBSPAM-UK2006
dataset ,It is based on a set of pages obtained
from a crawl of the .uk domain in May 2006.
7
Framework
The foundation of our spam detection system
is a cost-sensitive decision tree. The
evaluation of the detecting process is based on
the confusion matrix
8
Framework
We consider the following measures
9
Attributes
  • The set of features that we use to classify the
    web hosts including
  • Link-based features
  • Content-based features

10
Attributes
  • Link-based attributes
  • Degree-related measures in-degree, out-degree
  • PageRank computes a score for each page
  • TrustRank Starting from trusted nodes
  • Truncated PageRank diminishes the influence
  • of close
    neighbors

11
Attributes
  • Content-based attributes
  • Number of words in the page,
  • number of words in the title,
  • average word length
  • Fraction of anchor text
  • Fraction of visible text
  • Compression rate
  • Corpus precision and corpus recall
  • (based on k most frequent words in dataset)
  • Query precision and query recall
  • (based on k most popular words in query log)

12
Attributes
  • Quality of content features for the spam and
    the non-spam pages
  • Average word length

13
Attributes
  • Quality of content features for the spam and
    the non-spam pages
  • Corpus precision

14
Attributes
  • Quality of content features for the spam and
    the non-spam pages
  • Query precision

15
Classifiers
  • In the decision tree algorithm, How to
    classify?
  • a value k is determined for each feature
  • instances less than the value are assigned one
    class label
  • instances greater than the value are assigned
    the other class label.

16
Classifiers
  • cost-sensitive decision tree
  • imposed a cost of zero for correctly classifying
    the instances
  • set the cost of misclassifying a spam host as
    normal to be R times more costly than
    misclassifying a normal host as spam.

17
Classifiers
According to the results for different values
of R The value of R becomes a parameter that
can be tuned to balance the true positive rate
and the false positive rate. In our case, we
wish to maximize the F-measure.
18
Classifiers
Bagging improved our results by reducing the
false-positive rate, as shown in Table 3.
19
Classifiers
  • Table 4 shows the contribution of each type
    of feature to the classification.
  • The content features serve to reduce the
    false-positive rate,
  • The link features serve to improve the
    true-positive rate.

20
Smoothing
Similar pages tend to be linked together more
frequently than dissimilar ones. In this
section we investigate the connections between
spam hosts in order to improve the accuracy of
our classifiers.
21
Smoothing
22
Smoothing
Topological dependencies of spam nodes
23
Smoothing
Topological dependencies of spam nodes
24
Smoothing
We use the result of a graph clustering
algorithm to improve the prediction obtained from
the classification algorithm.
25
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com