DiffusionRank: A Possible Penicillin for Web Spamming - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

DiffusionRank: A Possible Penicillin for Web Spamming

Description:

All pages are born equal--equal voting ability of one page: the sum of each ... pages are born with high temperatures while others are born with low temperatures. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 19
Provided by: jhi75
Category:

less

Transcript and Presenter's Notes

Title: DiffusionRank: A Possible Penicillin for Web Spamming


1
DiffusionRank A Possible Penicillin for Web
Spamming
  • Haixuan Yang
  • Group Meeting
  • Jan. 16, 2006.

2
Outline
  • Introduction
  • DiffusionRank
  • Model Establishment
  • Computation consideration
  • Discussion on ?
  • Results
  • Conclusions

3
Introduction
  • PageRank
  • Tries to find the importance of a Web page based
    on the link structure.
  • The importance of a page i is defined recursively
    in terms of pages which point to it
  • It proves to be effective for ranking Web pages.

4
Introduction
  • PageRank
  • Two problems
  • The incomplete information about the Web
    structure.
  • Solution predict the Web Structure as a random
    graph.
  • The web pages manipulated by people for
    commercial interests.
  • About 70 of all pages in the .biz domain are
    spam
  • About 35 of the pages in the .us domain belong
    to spam category.
  • Two methods used for manipulating spam pages
  • Link Stuffing
  • Keyword Stuffing
  • Solution DiffusionRank

5
An example for manipulation
The rank value of node 1 can be increased greatly!
6
Why?
  • Two reasons
  • Over-democratic
  • All pages are born equal--equal voting ability of
    one page the sum of each column is equal to one.
  • Input-independent
  • For any given non-zero initial input, the
    iteration will converge to the same stable
    distribution.
  • Heat Diffusion Model -- a natural way to avoid
    these two factors
  • Pages are not equal as some pages are born with
    high temperatures while others are born with low
    temperatures.
  • Different initial temperature distributions will
    give rise to different temperature distributions
    after a fixed time period.

7
DiffusionRank
  • On an undirected graph
  • Assumption the amount of the heat flow from j to
    i is proportional to the heat difference between
    i and j.
  • Solution

8
DiffusionRank
  • On an undirected graph
  • Assumption the amount of the heat flow from j to
    i is proportional to the heat difference between
    i and j.
  • Solution
  • On a directed graph
  • Assumption there is extra energy imposed on the
    link (j, i) such that the heat flow only from j
    to i if there is no link (i,j).
  • Solution
  • On a random directed graph
  • Assumption the heat flow is proportional to the
    probability of the link (j,i).
  • Solution

9
DiffusionRank
  • On a random directed graph
  • Solution
  • The initial value f(i,0) in f(0) is set to be 1
    if i is trusted and 0 otherwise according to the
    inverse PageRank.

10
Computation consideration
  • Approximation of heat kernel
  • N?
  • When Ngt30, the real eigenvalues of
    are less than 0.01
  • when Ngt100, they are less than 0.005.
  • We use N100 in the paper.

When N tends to infinity
11
Discuss ?
  • ?can be understood as the thermal conductivity.
  • When ?0, the ranking value is most robust to
    manipulation since no heat is diffused, but the
    Web structure is completely ignored
  • When ? 8, DiffusionRank becomes PageRank, it can
    be manipulated easily.
  • When?1, DiffusionRank works well in practice

12
DiffusionRank
  • Advantages
  • Can detect Group-group relations
  • Can cut Graphs
  • Anti-manipulation

? 0.5 or 1
1
-1
13
DiffusionRank
  • Experiments
  • Data
  • a toy graph (6 nodes)
  • a middle-size real-world graph (18542 nodes)
  • a large-size real-world graph crawled from CUHK
    (607170 nodes)
  • Compare with TrustRank and PageRank

14
Results
  • The tendency of DiffusionRank when ? becomes
    larger
  • On the toy graph

15
Anti-manipulation On the toy graph
16
Anti-manipulation on the middle graph and the
large graph
17
Stability--the order difference between ranking
results for an algorithm before it is manipulated
and those after that
18
Conclusions
  • This anti-manipulation feature enables
    DiffusionRank to be a candidate as a penicillin
    for Web spamming.
  • DiffusionRank is a generalization of PageRank
    (when ?8).
  • DiffusionRank can be employed to detect
    group-group relation.
  • DiffusionRank can be used to cut graph.
Write a Comment
User Comments (0)
About PowerShow.com