Title: Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure
1Extending Link-based Algorithms for Similar Web
Pages with Neighborhood Structure
- Allen, Zhenjiang LIN
- CSE, CUHK
- 13 Dec 2006
2Outline
- 1. Introduction
- 2. Extended Neighborhood Structure Model
- 3. Extending Link-based Similarity Measures
- 4. Experimental Results
- 5. Conclusion and Future Work
31. Introduction
- Background
- Similarity measures are required in many web
applications to evaluate the similarity between
web pages. - The similar pages service of Web search
engines - Web document classification
- Web community identification.
- Problem
- Many link-based similarity measures are not so
accurate since they consider only part of the
structural information.
41. Introduction
- Motivation
- How to improve the accuracy of link-based
similarity measures by making full use of the
structural information? - Contributions
- Propose the Extended Neighborhood Structure (ENS)
model. - bi-direction
- multi-hop
- Construct extended link-based similarity measures
base on the ENS model. - more flexible and accurate
51. Introduction
- Searching the Web
- Keyword searching
- Similarity searching
KEYWORDS news
http//news.bbc.co.uk/ http//www.cnn.com/
URL www.cnn.com
Search Engine
http//news.bbc.co.uk/ http//usnews.com/
61. Introduction
- Similarity measures
- Evaluate how similarity or related two objects
are. - Approaches to measuring similarity
- Text-based
- Cosine TFIDF Joachims97
- Link-based
- Bibliographic coupling Kessler63
- Co-citation Small73
- SimRank Jeh et al 02, PageSim Lin et al 06
- Hybrid
72. Extend Neighborhood Structure Model
- Extended Neighborhood Structure (ENS) model
- Question what hide in hyperlinks?
- similarity relationship between pages,
- similarity relationship decrease along hyperlinks.
82. Extend Neighborhood Structure Model
- Extended Neighborhood Structure (ENS) model
- The ENS model
- bi-direction
- in-link
- out-link
- multi-hop
- direct (1-hop)
- indirect (2-hop, 3-hop, etc)
- Purpose
- Improve accuracy of link-based similarity
measures by helping them make full use of the
structural information of the Web.
93. Extending Link-based Similarity Measures
- Intuition of similarity
- Similar web pages have similar neighbors.(to
compare two web pages, see their neighbors.) - Notations
- G(V, E), V n the web graph.
- I(a) / O(a) in-link / out-link neighbors of web
page a. - path(a1, as) a sequence of vertices a1, a2, ,
as such that (ai, ai1) ? E (i1,,s-1) and ai
are distinct. - PATH(a,b) the set of all possible paths from
page a to b. - Sim(a,b) similarity score of web page a and b.
103. Extending Link-based Similarity Measures
- Two classical methods
- Co-citation the more common in-link neighbors,
the more similar. - Sim(a,b) I(a)nI(b)
- Bibliographic coupling the more common out-link
neighbors, the more similar. - Sim(a,b) O(a)nO(b)
- Extended Co-citation and Bibliographic Coupling
(ECBC) - ECBC the more common neighbors, the more
similar. - Sim(a,b) aI(a)nI(b) (1-a)O(a)nO(b),
where 0a1 is a constant.
113. Extending Link-based Similarity Measures
- SimRank
- two pages are similar if they are linked to by
similar pages - (1) Sim(u,u)1 (2) Sim(u,v)0 if I(u) I(v)
0. - Recursive definition
- C is a constant between 0 and 1.
- The iteration starts with Sim(u,u)1, Sim(u,v)0
if u? v. -
-
123. Extending Link-based Similarity Measures
- Extended SimRank
- two pages are similar if they have similar
neighbors - (1) Sim(u,u)1 (2) Sim(u,v)0 if I(u) I(v)
0. - Recursive definition
- C is a constant between 0 and 1.
- The iteration starts with Sim(u,u)1, Sim(u,v)0
if u? v.
133. Extending Link-based Similarity Measures
- PageSim
- weighted multi-hop version of Co-citation
algorithm. - (a) multi-hop in-link information, and
- (b) importance of web pages.
- Can be represented by any global scoring system
- PageRank scores, or
- Authoritative scores of HITS.
143. Extending Link-based Similarity Measures
- PageSim (phase 1 feature propagation)
- Initially, each web page contains an unique
feature information, which is represented by its
PageRank score. - The feature information of a web page is
propagated along out-link hyperlinks at decay
rate d. The PR score of u propagated to v is
defined by
153. Extending Link-based Similarity Measures
- PageSim (phase 2 similarity computation)
- A web page v stores the feature information of
its and others in its Feature Vector FV(v). - The similarity between web page u and v is
computed by Jaccard measure Jain et al 88 - Intuition the more common feature information
two web pages contain, the more similar they are.
163. Extending Link-based Similarity Measures
- Extended PageSim (EPS)
- Propagating feature information of web pages
along in-link hyperlinks at decay rate 1- d. - Computing the in-link PS scores.
- EPS(u,v) in-link PS(u,v) out-link PS(u,v).
173. Extending Link-based Similarity Measures
- Properties
- CC Co-citation, BC Bibliographic Coupling,
- ECBC Extended Co-citation and Bibliographic
Coupling, - SR SimRank, ESR Extended SimRank, PS PageSim,
EPS Extended PageSim. - Summary
- The extended versions consider more structural
information. - ESR and EPS are bi-directional multi-hop.
- In ESR, two web pages are not similar unless
there are intermediate pages between them, even
if they link to other (see Figure 1(2)).
183. Extending Link-based Similarity Measures
- Case study Sim(a,b)
- Summary
- The extended algorithms are more flexible.
- EPS is able to handle more cases.
194. Experimental Results
- Datasets
- CSE Web (CW) dataset
- A set of web pages crawled from
http//cse.cuhk.edu.hk. - 22,000 pages, 180,000 hyperlinks.
- The average number of in-links and out-links are
8.6 and 7.7. - Google Scholar (GS) dataset
- A set of articles crawled from Google Scholar
searching engine. - Start crawling by submitting web mining
keywords to GS, and then following the Cited by
hyperlinks. - 20,000 articles, 154,000 citations.
204. Experimental Results
- Evaluation Methods
- Cosine TFIDF similarity (for CW dataset)
- A commonly used text-based similarity measure.
- Related Articles (for GS dataset)
- A list of related articles to a query article
provided by GS. - Can be used as ground truth.
- Parameter Settings
214. Experimental Results
- CC, BC vs ECBC
- CW data (left) x-axis top N results y-axis
average cosine TFIDF of all pages. - GS data (right) x-axis top N results y-axis
average precision of all pages.
224. Experimental Results
- SimRank vs Extended SimRank
- CW data (left) x-axis top N results y-axis
average cosine TFIDF of all pages. - GS data (right) x-axis top N results y-axis
average precision of all pages.
234. Experimental Results
- PageSim vs Extended PageSim
- CW data (left) x-axis top N results y-axis
average cosine TFIDF of all pages. - GS data (right) x-axis top N results y-axis
average precision of all pages.
244. Experimental Results
- Overall Accuracy of Algorithms
255. Conclusion and Future Work
- Conclusion
- Extended Neighborhood Structure model
- Bi-direction and multi-hop
- Extend existing link-based similarity measures
- Co-citation, Bibliographic coupling, SimRank,
PageSim - Experiments
- Future Work
- Extend link-based algorithms based on ENS model
- Prove the convergence of the Extended SimRank
- Integrating link-based with text-based
26Publications
- Z. Lin, M. R. Lyu, and I. King. PageSim A novel
link-based measure of web page similarity. In WWW
'06 Proceedings of the 15th international
conference on World Wide Web. Pages 1019-1020,
Edinburgh, Scotland, 2006. - Z. Lin, I. King, and M. R. Lyu. PageSim A novel
link-based similarity measure for the World Wide
Web. In WI 06 Proceedings of the 5th
International Conference on Web Intelligence. ACM
Press. To appear, 2006. - Z. Lin, M. R. Lyu, and I. King. Extending
Link-based Algorithms for Similar Web Pages with
Neighborhood Structure. Submitted to WWW07.