Know your Neighbors: Web Spam Detection using the Web Topology - PowerPoint PPT Presentation

About This Presentation
Title:

Know your Neighbors: Web Spam Detection using the Web Topology

Description:

Data set is obtained by using web crawler. For each page, links and its contents are obtained. ... Spam tends to be clustered on the Web. ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 21
Provided by: Bab891
Category:

less

Transcript and Presenter's Notes

Title: Know your Neighbors: Web Spam Detection using the Web Topology


1
Know your Neighbors Web Spam Detection using the
Web Topology
  • By Carlos Castillo, Debora Donato, Aristides
    Gionis, Vanessa Murdock and Fabrizio Silvestri
  • Presented by Sovandy Hang
  • CS 4440, Fall 2007

2
Outline
  • About me
  • Introduction
  • Keywords
  • How the process works?
  • Conclusion
  • Questions and answers

3
About Me
  • 5th year CS and IE major
  • Graduate next summer
  • Interest Enterprise Resource Planning
  • Think all softwares should be open source

4
Introduction
  • Web search is a part of our lives.
  • Many businesses rely on web.
  • There is huge economic incentive for commercial
    website to influence search results.
  • Web spamming is cheap and often successful.
  • Web spam degrades the quality of search engine.
  • Web spam is annoying.

5
Keywords
  • Web spam
  • Pagerank
  • Spamdexing
  • Spamicity
  • Graph-based algorithm

6
Measurement Tool
7
How it work?
Clustering
Feature Extraction
Classification
Smoothing
Propagation
Stack Graphical Learning
8
Feature Extraction
  • Data set is obtained by using web crawler.
  • For each page, links and its contents are
    obtained.
  • From data set, a full graph is built.
  • For each host and page, certain features are
    computed.
  • Link-based features are extracted from hostgraph.
  • Content-based feature are extracted from
    individual pages.

9
Linked-based Feature
  • Some important linked-based features are
  • Degree-related measures
  • PageRank
  • TrustRank
  • Truncated PageRank
  • Estimation of supporters

10
Content-based Feature
  • Some important content-based features are
  • Fraction of visible text
  • Compressing rate
  • Corpus precision and corpus recall
  • Query precision and query recall
  • Independent trigram likelihood
  • Entropy of diagram

11
Classification
  • Create base classifier from link-based
    content-based features.
  • Apply cost-sensitive decision tree to classify
    spam and non-spam hosts.

12
Smoothing
  • Hosts are now labeled as spam and non-spam by
    classifier.
  • Its an improvement on base classifier.
  • Few smoothing techniques are
  • Clustering
  • Propagation
  • Stacked graphical learning.

13
Smoothing (Cont.)
  • Based on topological dependencies of spam node
  • Links are not placed at random.
  • Similar pages tends to link more frequently than
    dissimilar pages.
  • Or
  • Spam tends to be clustered on the Web.
  • Non-spam nodes tend to be linked by very few spam
    nodes, and usually link to no spam nodes.
  • Spam nodes are mainly linked by spam nodes.

14
(No Transcript)
15
Smoothing - Clustering
  • Split graph into many clusters.
  • Use METIS graph clustering algorithm.
  • If majority of nodes in cluster are spam, then
    all hosts in cluster are spam.

16
Smoothing - Propagation
  • Propagate predictions using random walks.
  • Start from node labeled as spam by base
    classifier then go forward or backward.

17
Smoothing Stack Graphical Learning
  • Its machine learning process.
  • It creates extra features in addition to
    content-based and linked-based ones.

18
Conclusion
  • Based on assumption that there is a tendency of
    spammers to be linked together.
  • Using both link-based and content-based feature
    enhance the detection quality.
  • It can be used on web datasets of any size.
  • Paper does not explain very well each step.

19
Useful Reading
  • Using Spam Farm to Boost PageRank by Ye Du,
    Yaoyun Shi, Xin Zhao
  • Using Annotations in Enterprise Search by Pavel
    A. Dmitriev, Nadav Eiron, Marcus Fontoura, Eugene
    Shekita

20
Question ?
Write a Comment
User Comments (0)
About PowerShow.com