Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure - PowerPoint PPT Presentation

About This Presentation
Title:

Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure

Description:

URL: www.cnn.com. http://news.bbc.co.uk/ http://usnews.com/ ... similarity measure ... Pages 1019-1020, Edinburgh, Scotland, 2006. Z. Lin, I. King, and M. R. Lyu. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 27
Provided by: All5162
Category:

less

Transcript and Presenter's Notes

Title: Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure


1
Extending Link-based Algorithms for Similar Web
Pages with Neighborhood Structure
  • Allen, Zhenjiang LIN
  • CSE, CUHK
  • 13 Dec 2006

2
Outline
  • 1. Introduction
  • 2. Extended Neighborhood Structure Model
  • 3. Extending Link-based Similarity Measures
  • 4. Experimental Results
  • 5. Conclusion and Future Work

3
1. Introduction
  • Background
  • Similarity measures are required in many web
    applications to evaluate the similarity between
    web pages.
  • The similar pages service of Web search
    engines
  • Web document classification
  • Web community identification.
  • Problem
  • Many link-based similarity measures are not so
    accurate since they consider only part of the
    structural information.

4
1. Introduction
  • Motivation
  • How to improve the accuracy of link-based
    similarity measures by making full use of the
    structural information?
  • Contributions
  • Propose the Extended Neighborhood Structure (ENS)
    model.
  • bi-direction
  • multi-hop
  • Construct extended link-based similarity measures
    base on the ENS model.
  • more flexible and accurate

5
1. Introduction
  • Searching the Web
  • Keyword searching
  • Similarity searching

KEYWORDS news
http//news.bbc.co.uk/ http//www.cnn.com/
URL www.cnn.com
Search Engine
http//news.bbc.co.uk/ http//usnews.com/
6
1. Introduction
  • Similarity measures
  • Evaluate how similarity or related two objects
    are.
  • Approaches to measuring similarity
  • Text-based
  • Cosine TFIDF Joachims97
  • Link-based
  • Bibliographic coupling Kessler63
  • Co-citation Small73
  • SimRank Jeh et al 02, PageSim Lin et al 06
  • Hybrid

7
2. Extend Neighborhood Structure Model
  • Extended Neighborhood Structure (ENS) model
  • Question what hide in hyperlinks?
  • similarity relationship between pages,
  • similarity relationship decrease along hyperlinks.

8
2. Extend Neighborhood Structure Model
  • Extended Neighborhood Structure (ENS) model
  • The ENS model
  • bi-direction
  • in-link
  • out-link
  • multi-hop
  • direct (1-hop)
  • indirect (2-hop, 3-hop, etc)
  • Purpose
  • Improve accuracy of link-based similarity
    measures by helping them make full use of the
    structural information of the Web.

9
3. Extending Link-based Similarity Measures
  • Intuition of similarity
  • Similar web pages have similar neighbors.(to
    compare two web pages, see their neighbors.)
  • Notations
  • G(V, E), V n the web graph.
  • I(a) / O(a) in-link / out-link neighbors of web
    page a.
  • path(a1, as) a sequence of vertices a1, a2, ,
    as such that (ai, ai1) ? E (i1,,s-1) and ai
    are distinct.
  • PATH(a,b) the set of all possible paths from
    page a to b.
  • Sim(a,b) similarity score of web page a and b.

10
3. Extending Link-based Similarity Measures
  • Two classical methods
  • Co-citation the more common in-link neighbors,
    the more similar.
  • Sim(a,b) I(a)nI(b)
  • Bibliographic coupling the more common out-link
    neighbors, the more similar.
  • Sim(a,b) O(a)nO(b)
  • Extended Co-citation and Bibliographic Coupling
    (ECBC)
  • ECBC the more common neighbors, the more
    similar.
  • Sim(a,b) aI(a)nI(b) (1-a)O(a)nO(b),
    where 0a1 is a constant.

11
3. Extending Link-based Similarity Measures
  • SimRank
  • two pages are similar if they are linked to by
    similar pages
  • (1) Sim(u,u)1 (2) Sim(u,v)0 if I(u) I(v)
    0.
  • Recursive definition
  • C is a constant between 0 and 1.
  • The iteration starts with Sim(u,u)1, Sim(u,v)0
    if u? v.

12
3. Extending Link-based Similarity Measures
  • Extended SimRank
  • two pages are similar if they have similar
    neighbors
  • (1) Sim(u,u)1 (2) Sim(u,v)0 if I(u) I(v)
    0.
  • Recursive definition
  • C is a constant between 0 and 1.
  • The iteration starts with Sim(u,u)1, Sim(u,v)0
    if u? v.

13
3. Extending Link-based Similarity Measures
  • PageSim
  • weighted multi-hop version of Co-citation
    algorithm.
  • (a) multi-hop in-link information, and
  • (b) importance of web pages.
  • Can be represented by any global scoring system
  • PageRank scores, or
  • Authoritative scores of HITS.

14
3. Extending Link-based Similarity Measures
  • PageSim (phase 1 feature propagation)
  • Initially, each web page contains an unique
    feature information, which is represented by its
    PageRank score.
  • The feature information of a web page is
    propagated along out-link hyperlinks at decay
    rate d. The PR score of u propagated to v is
    defined by

15
3. Extending Link-based Similarity Measures
  • PageSim (phase 2 similarity computation)
  • A web page v stores the feature information of
    its and others in its Feature Vector FV(v).
  • The similarity between web page u and v is
    computed by Jaccard measure Jain et al 88
  • Intuition the more common feature information
    two web pages contain, the more similar they are.

16
3. Extending Link-based Similarity Measures
  • Extended PageSim (EPS)
  • Propagating feature information of web pages
    along in-link hyperlinks at decay rate 1- d.
  • Computing the in-link PS scores.
  • EPS(u,v) in-link PS(u,v) out-link PS(u,v).

17
3. Extending Link-based Similarity Measures
  • Properties
  • CC Co-citation, BC Bibliographic Coupling,
  • ECBC Extended Co-citation and Bibliographic
    Coupling,
  • SR SimRank, ESR Extended SimRank, PS PageSim,
    EPS Extended PageSim.
  • Summary
  • The extended versions consider more structural
    information.
  • ESR and EPS are bi-directional multi-hop.
  • In ESR, two web pages are not similar unless
    there are intermediate pages between them, even
    if they link to other (see Figure 1(2)).

18
3. Extending Link-based Similarity Measures
  • Case study Sim(a,b)
  • Summary
  • The extended algorithms are more flexible.
  • EPS is able to handle more cases.

19
4. Experimental Results
  • Datasets
  • CSE Web (CW) dataset
  • A set of web pages crawled from
    http//cse.cuhk.edu.hk.
  • 22,000 pages, 180,000 hyperlinks.
  • The average number of in-links and out-links are
    8.6 and 7.7.
  • Google Scholar (GS) dataset
  • A set of articles crawled from Google Scholar
    searching engine.
  • Start crawling by submitting web mining
    keywords to GS, and then following the Cited by
    hyperlinks.
  • 20,000 articles, 154,000 citations.

20
4. Experimental Results
  • Evaluation Methods
  • Cosine TFIDF similarity (for CW dataset)
  • A commonly used text-based similarity measure.
  • Related Articles (for GS dataset)
  • A list of related articles to a query article
    provided by GS.
  • Can be used as ground truth.
  • Parameter Settings

21
4. Experimental Results
  • CC, BC vs ECBC
  • CW data (left) x-axis top N results y-axis
    average cosine TFIDF of all pages.
  • GS data (right) x-axis top N results y-axis
    average precision of all pages.

22
4. Experimental Results
  • SimRank vs Extended SimRank
  • CW data (left) x-axis top N results y-axis
    average cosine TFIDF of all pages.
  • GS data (right) x-axis top N results y-axis
    average precision of all pages.

23
4. Experimental Results
  • PageSim vs Extended PageSim
  • CW data (left) x-axis top N results y-axis
    average cosine TFIDF of all pages.
  • GS data (right) x-axis top N results y-axis
    average precision of all pages.

24
4. Experimental Results
  • Overall Accuracy of Algorithms

25
5. Conclusion and Future Work
  • Conclusion
  • Extended Neighborhood Structure model
  • Bi-direction and multi-hop
  • Extend existing link-based similarity measures
  • Co-citation, Bibliographic coupling, SimRank,
    PageSim
  • Experiments
  • Future Work
  • Extend link-based algorithms based on ENS model
  • Prove the convergence of the Extended SimRank
  • Integrating link-based with text-based

26
Publications
  • Z. Lin, M. R. Lyu, and I. King. PageSim A novel
    link-based measure of web page similarity. In WWW
    '06 Proceedings of the 15th international
    conference on World Wide Web. Pages 1019-1020,
    Edinburgh, Scotland, 2006.
  • Z. Lin, I. King, and M. R. Lyu. PageSim A novel
    link-based similarity measure for the World Wide
    Web. In WI 06 Proceedings of the 5th
    International Conference on Web Intelligence. ACM
    Press. To appear, 2006.
  • Z. Lin, M. R. Lyu, and I. King. Extending
    Link-based Algorithms for Similar Web Pages with
    Neighborhood Structure. Submitted to WWW07.
Write a Comment
User Comments (0)
About PowerShow.com