PageSim: A Link-based Similarity Measure for the World Wide Web - PowerPoint PPT Presentation

About This Presentation
Title:

PageSim: A Link-based Similarity Measure for the World Wide Web

Description:

Testing the decay factor of PageSim. Evaluating the performance of the algorithms: ... on the Decay Factor of PageSim. CW data (left): x-axis: decay factor d; ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 19
Provided by: All5162
Category:

less

Transcript and Presenter's Notes

Title: PageSim: A Link-based Similarity Measure for the World Wide Web


1
PageSim A Link-based Similarity Measure for the
World Wide Web
  • Zhenjiang Lin, Irwin King, and Michael, R., Lyu
  • Computer Science Engineering,
  • The Chinese University of Hong Kong
  • 20 Dec 2006

2
Outline
  • 1. Introduction
  • 2. Related Work
  • 3. PageSim
  • 4. Experimental Results
  • 5. Conclusion and Future Work

3
1. Introduction
  • Background
  • Similarity measures are required in many web
    applications to evaluate the similarity between
    web pages.
  • The similar pages service of web search
    engines
  • Web document classification
  • Web community identification.

4
1. Introduction
  • Similarity measures
  • Evaluate how similarity or related two objects
    are.
  • Approaches to measuring similarity
  • Text-based
  • Cosine TFIDF Joachims97
  • Link-based
  • Bibliographic coupling Kessler63
  • Co-citation Small73
  • SimRank Jeh et al 02, PageSim Lin et al 06
  • Hybrid

5
1. Introduction
  • Problem
  • How to evaluate similarity between web pages
    purely on the structural information of the Web?
  • Motivation
  • Developing effective link-based similarity
    measure for the World Wide Web.
  • Contributions
  • PROPOSE a novel link-based similarity measure
    PageSim.
  • more flexible and accurate

6
2. Related Work
  • What hide in hyperlinks?
  • (1) similarity relationship between pages,
  • (2) similarity relationship decrease along
    hyperlinks.

7
2. Related Work
  • Intuition of similarity
  • Similar web pages have similar neighbors.(to
    compare two web pages, see their neighbors.)
  • Notations
  • G(V, E), V n the web graph.
  • I(a) / O(a) in-link / out-link neighbors of web
    page a.
  • path(a1, as) a sequence of vertices a1, a2, ,
    as such that (ai, ai1) ? E (i1,,s-1) and ai
    are distinct.
  • PATH(a,b) the set of all possible paths from
    page a to b.
  • Sim(a,b) similarity score of web page a and b.

8
2. Related Work
  • Two classical methods
  • Co-citation the more common in-link neighbors,
    the more similar.
  • Sim(a,b) I(a)nI(b)
  • Bibliographic coupling the more common out-link
    neighbors, the more similar.
  • Sim(a,b) O(a)nO(b)

9
2. Related Work
  • SimRank
  • two pages are similar if they are linked to by
    similar pages
  • (1) Sim(u,u)1 (2) Sim(u,v)0 if I(u) I(v)
    0.
  • Recursive definition
  • C is a constant between 0 and 1.
  • The iteration starts with Sim(u,u)1, Sim(u,v)0
    if u? v.

10
3. PageSim
  • Intuition behind PageSim
  • Similar pages have similar neighbors (both direct
    and indirect).
  • Strategies in PageSim
  • (a) Each web page contains unique feature
    information and propagates this information to
    its multi-hop neighbors.
  • (b) Importance web pages contain more feature
    information, which can be represented by any
    global scoring system.
  • PageRank scores, or Authoritative scores of HITS.
  • (c) Two web pages are more similar, if they share
    more common feature information.

11
3. PageSim
  • PageSim (phase 1 feature propagation)
  • Initially, each web page contains an unique
    feature information, which is represented by its
    PageRank score.
  • The feature information of a web page is
    propagated along out-link hyperlinks at decay
    rate d. The PR score of u propagated to v is
    defined by

12
3. PageSim
  • PageSim (phase 2 similarity computation)
  • A web page v stores the feature information of
    its and others in its Feature Vector FV(v).
  • The similarity between web page u and v is
    computed by Jaccard measure Jain et al 88
  • Intuition the more common feature information
    two web pages share, the more similar they are.

13
3. PageSim
  • Case study Sim(a,b)
  • CC Co-citation
  • BC Bibliographic Coupling
  • SR SimRank
  • PS PageSim
  • PageSim is more flexible, since it is able to
    handle more cases.

14
4. Experimental Results
  • Datasets
  • CSE Web (CW) dataset
  • A set of web pages crawled from
    http//cse.cuhk.edu.hk.
  • 22,000 pages, 180,000 hyperlinks.
  • The average number of in-links and out-links are
    8.6 and 7.7.
  • Google Scholar (GS) dataset
  • A set of articles crawled from Google Scholar
    searching engine.
  • Start crawling by submitting web mining
    keywords to GS, and then crawl the articles by
    following the Cited by hyperlinks.
  • 20,000 articles, 154,000 citations.

15
4. Experimental Results
  • Evaluation Methods
  • Cosine TFIDF similarity (for CW dataset)
  • A commonly used text-based similarity measure.
  • Related Articles (for GS dataset)
  • A list of related articles to a query article
    provided by GS.
  • Can be used as ground truth.
  • Experiments
  • Testing the decay factor of PageSim
  • Evaluating the performance of the algorithms
  • CC Co-citation, BC Bibliographic Coupling,
  • SR SimRank, PS PageSim.

16
4. Experimental Results
  • Result on the Decay Factor of PageSim
  • CW data (left) x-axis decay factor d
    y-axis average cosine TFIDF of all pages.
  • GS data (right) x-axis decay factor d
    y-axis average precision of all pages.

17
4. Experimental Results
  • Performance Evaluation of Algorithms
  • CW data (left) x-axis decay factor d
    y-axis average cosine TFIDF of all pages.
  • GS data (right) x-axis decay factor d
    y-axis average precision of all pages.

18
5. Conclusion and Future Work
  • Conclusion
  • Lin-based similarity measures
  • Bibliographic coupling, Co-citation, and SimRank
  • PageSim
  • Feature information propagation
  • The more common feature information, the more
    similar
  • Experiments
  • Future Work
  • Testing on more datasets.
  • Integrating link-based with text-based
Write a Comment
User Comments (0)
About PowerShow.com