Know your Neighbors: Web Spam Detection using the Web Topology

About This Presentation

Title:

Know your Neighbors: Web Spam Detection using the Web Topology

Description:

web spam is a malicious attempt to influence the outcome of ranking ... Bagging improved our results by reducing the false-positive rate, as shown in Table 3. ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 26

Provided by: anthonyeo

Category:

more less

Transcript and Presenter's Notes

Title: Know your Neighbors: Web Spam Detection using the Web Topology

1
Know your Neighbors Web Spam Detection using
the Web Topology
Carlos Castillo, Debora Donato, Aristides
Gionis Yahoo! Research Barcelona, SPAIN

SIGIR07

2
Outline

Introduction
Previous work
Dataset and Framework
Attributes
Classifiers
Smoothing
Conclusions

3
Introduction

What is web spam?
web spam is a malicious attempt to influence the
outcome of ranking algorithms, for the purpose of
getting an high rank from search engine.
What is the influence of web spam?
damages the reputation of search engines
weakens the trust of its users.

4
Introduction

Our main contributions
the first paper that integrates link and content
attributes for building a system to detect Web
spam
We investigate the use of a cost sensitive
classifier in this classification problem
We demonstrate improvements in the classification
accuracy using dependencies among neighboring
hosts

5
Previous work

Previous work on Web spam detection has
focused mostly on the detection of three types of
Web spam
Link spam creation of a link structure aimed at
affecting the outcome of a link-based ranking
algorithm
Content spam maliciously crafting the content of
Web pages
Cloaking sending different content to a search
engine than to the regular visitors of a web site

6
Dataset
We use the publicly available WEBSPAM-UK2006
dataset ,It is based on a set of pages obtained
from a crawl of the .uk domain in May 2006.
7
Framework
The foundation of our spam detection system
is a cost-sensitive decision tree. The
evaluation of the detecting process is based on
the confusion matrix
8
Framework
We consider the following measures
9
Attributes

The set of features that we use to classify the
web hosts including
Link-based features
Content-based features

10
Attributes

Link-based attributes
Degree-related measures in-degree, out-degree
PageRank computes a score for each page
TrustRank Starting from trusted nodes
Truncated PageRank diminishes the influence
of close
neighbors

11
Attributes

Content-based attributes
Number of words in the page,
number of words in the title,
average word length
Fraction of anchor text
Fraction of visible text
Compression rate
Corpus precision and corpus recall
(based on k most frequent words in dataset)
Query precision and query recall
(based on k most popular words in query log)

12
Attributes

Quality of content features for the spam and
the non-spam pages
Average word length

13
Attributes

Quality of content features for the spam and
the non-spam pages
Corpus precision

14
Attributes

Quality of content features for the spam and
the non-spam pages
Query precision

15
Classifiers

In the decision tree algorithm, How to
classify?
a value k is determined for each feature
instances less than the value are assigned one
class label
instances greater than the value are assigned
the other class label.

16
Classifiers

cost-sensitive decision tree
imposed a cost of zero for correctly classifying
the instances
set the cost of misclassifying a spam host as
normal to be R times more costly than
misclassifying a normal host as spam.

17
Classifiers
According to the results for different values
of R The value of R becomes a parameter that
can be tuned to balance the true positive rate
and the false positive rate. In our case, we
wish to maximize the F-measure.
18
Classifiers
Bagging improved our results by reducing the
false-positive rate, as shown in Table 3.
19
Classifiers

Table 4 shows the contribution of each type
of feature to the classification.
The content features serve to reduce the
false-positive rate,
The link features serve to improve the
true-positive rate.

20
Smoothing
Similar pages tend to be linked together more
frequently than dissimilar ones. In this
section we investigate the connections between
spam hosts in order to improve the accuracy of
our classifiers.
21
Smoothing
22
Smoothing
Topological dependencies of spam nodes
23
Smoothing
Topological dependencies of spam nodes
24
Smoothing
We use the result of a graph clustering
algorithm to improve the prediction obtained from
the classification algorithm.
25
Thanks!

Write a Comment

User Comments (0)