Title: Mining the Link Structure of the World Wide Web February 1999
1Mining the Link Structure of the World Wide Web
(February 1999)
- by
- Soumen Chakrabarti(1), Byron E. Dom(1), David
Gibson(2), Jon Kleinberg(2), Ravi Kumar(1),
Prabhakar Raghavan(1), Sridhar Rajagopalan(1),
Andrew Tomkins(1)
(1) IBM Almaden Research Center, CA (2) CS, UC
Berkeley, CA (3) CS, Cornell University, NY
presented by Maciej Janik CSCI 8350, 16 September
2004
2Outline
- Motivation and goal
- Algorithm description
- Initial results
- Limitations and enhancements
- Created system and evaluation
- In the search of hidden structures
- Conclusion
3Motivation
- Current scale of the Web - about 300 million web
pages. Now even more. - There is no unifying structure.
- Majority of them do not have any structure.
- Great variety in authoring style, content,
layout, etc.
4Motivation
- Search engines are index-based and limited to
keyword searches. - Keyword search, even very narrowed, not always
return expected results. - Problem with correctness of results or too wide
range. - Self-description of page may not include
appropriate keywords.
5Goal
- Design algorithms for mining link information.
- Develop techniques that take advantage of social
organisation of the Web. - Effectively find authorative pages in some field.
- Point knowledge hubs in some field.
6HITS algorithm
- Computes hubs and authorities for search topic
- Components
- sampling
- focused collection of pages likely to have many
relavant authories - weight-propagation
- determine weights of hubs and authorities in
iterative process
7HITS algorithm
- Web representation - directed graph
- Concentrate on links that go to other domains.
- Local links have mainly navigational purposes
- Apply iterations of algorithm to reduced set of
web pages.
8HITS steps - root and base set
9HITS steps - what to count?
- Distinguis the pattern
- of relevant pages
10HITS steps - how to count?
- Count OUT links
- Count IN links
11HITS - calculating weights
- Authority weight
- Hub weight
- Matrix notation A - adjacency matrix
- A(i, j) 1 if i-th page points to j-th page
12HITS - outcome
- Applying iterative multiplication (power
iteration) will lead to calculating eigenvector
of any non-degenerate initial vector. - Hubs and auhtorities as outcome of process.
- Eigenvector --gt not a process artifact.
13Basic results
- Although HITS is only link-based (it completely
disregard page content) results are quite good in
many tested queries. - Authors tested e.g. search engines
- algorithm returned Yahoo!, Excite, Magellan,
Lycos, AltaVista - none of these pages described itself as search
engine (at the time of experiment)
14Encountered problems
- From narrow topic, HITS tends to end in more
general one. - Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics - Pages from single domain / website can dominate
result, if they point to one page - not necessary
a good authority.
15Algorithm enhancements
- Use weighted sums for link calculation.
- Take advantage of anhor text - text surrounding
link itself. - Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one. - Disregard or minimize influence of links inside
one domain.
16Results from Clever system
- Clever (HITS with enhancements) Vs. AltaVista Vs.
Yahoo! - Comparison test on 26 broad topics.
- 37 users ranked results (bad, fair, good,
fantastic), not knowing the source - top 10 results from AltaVista
- 5 Hubs 5 Authorities from Clever
- 10 random pages from Yahoo! (as they appear in
alphabetical order in search)
17Results of experiment
- 31 of responses show no difference between
results quality. - 19 of responses evaluated Yahoo! higer than
Clever. - 50 of responses evaluated Clever higher than
Yahoo!
18Usage trawling the Web
- Looking for hidden cyber-communities in the Web
- group of content creators sharing a common
interest with set of Web pages - large number of small communities with narrow
topic - even not registered in newsgroups or other
catalogs
19How to find cyber communities
- Community - small group that has a dense pattern
of linkage. - Each community has similar linkage signature.
- Graph structure - directed bipartite graph
20Results in communities search
- Used web snapshot provided by Alexa.
- Found about 130,000 complete bipartite graphs of
3 webpages - coincident? - Manually tested sample of 400
- only 5 had no common topic
- 25 of communities were not represented in groups
catalog
21Conclusions
- Mining of WWW link structure can uncover social
networks. - Hubs and authority pages are more helpful in
learning the topic than standard search. - Algorithm can find useful pages even if it does
not describe itself as topic one.
22Conclusions
- Method of influenced weights is a basis for
Google page rank algorithm. - New page category not returned by search engine
HUB. - It is possible and useful to find structures in
unstructured web.
23Questions?
24References
- Mining the Link Structure of the World Wide Web
(1999) - S. Chakrabarti, B. Dom, D. Gibson, J.
Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan,
A. Tomkins http//www.cs.cornell.edu/home/kleinbe
r/ieee99.ps - Authoritative Sources in a Hyperlinked
Environment - Jon M. Kleinberghttp//www.cs.corn
ell.edu/home/kleinber/auth.pdf - Trawling emerging cyber-communities
automatically - R. Kumar, P. Raghavan, S.
Rajagopalan, A. Tomkinshttp//www8.org/w8-papers/
4a-search-mining/trawling/trawling.html