Mining the Link Structure of the World Wide Web February 1999 - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Mining the Link Structure of the World Wide Web February 1999

Description:

Graph structure - directed bipartite graph. Results in communities search ... Found about 130,000 complete bipartite graphs of 3 webpages - coincident? ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 25
Provided by: maciej7
Category:

less

Transcript and Presenter's Notes

Title: Mining the Link Structure of the World Wide Web February 1999


1
Mining the Link Structure of the World Wide Web
(February 1999)
  • by
  • Soumen Chakrabarti(1), Byron E. Dom(1), David
    Gibson(2), Jon Kleinberg(2), Ravi Kumar(1),
    Prabhakar Raghavan(1), Sridhar Rajagopalan(1),
    Andrew Tomkins(1)

(1) IBM Almaden Research Center, CA (2) CS, UC
Berkeley, CA (3) CS, Cornell University, NY
presented by Maciej Janik CSCI 8350, 16 September
2004
2
Outline
  • Motivation and goal
  • Algorithm description
  • Initial results
  • Limitations and enhancements
  • Created system and evaluation
  • In the search of hidden structures
  • Conclusion

3
Motivation
  • Current scale of the Web - about 300 million web
    pages. Now even more.
  • There is no unifying structure.
  • Majority of them do not have any structure.
  • Great variety in authoring style, content,
    layout, etc.

4
Motivation
  • Search engines are index-based and limited to
    keyword searches.
  • Keyword search, even very narrowed, not always
    return expected results.
  • Problem with correctness of results or too wide
    range.
  • Self-description of page may not include
    appropriate keywords.

5
Goal
  • Design algorithms for mining link information.
  • Develop techniques that take advantage of social
    organisation of the Web.
  • Effectively find authorative pages in some field.
  • Point knowledge hubs in some field.

6
HITS algorithm
  • Computes hubs and authorities for search topic
  • Components
  • sampling
  • focused collection of pages likely to have many
    relavant authories
  • weight-propagation
  • determine weights of hubs and authorities in
    iterative process

7
HITS algorithm
  • Web representation - directed graph
  • Concentrate on links that go to other domains.
  • Local links have mainly navigational purposes
  • Apply iterations of algorithm to reduced set of
    web pages.

8
HITS steps - root and base set
9
HITS steps - what to count?
  • Distinguis the pattern
  • of relevant pages

10
HITS steps - how to count?
  • Count OUT links
  • Count IN links

11
HITS - calculating weights
  • Authority weight
  • Hub weight
  • Matrix notation A - adjacency matrix
  • A(i, j) 1 if i-th page points to j-th page

12
HITS - outcome
  • Applying iterative multiplication (power
    iteration) will lead to calculating eigenvector
    of any non-degenerate initial vector.
  • Hubs and auhtorities as outcome of process.
  • Eigenvector --gt not a process artifact.

13
Basic results
  • Although HITS is only link-based (it completely
    disregard page content) results are quite good in
    many tested queries.
  • Authors tested e.g. search engines
  • algorithm returned Yahoo!, Excite, Magellan,
    Lycos, AltaVista
  • none of these pages described itself as search
    engine (at the time of experiment)

14
Encountered problems
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics
  • Pages from single domain / website can dominate
    result, if they point to one page - not necessary
    a good authority.

15
Algorithm enhancements
  • Use weighted sums for link calculation.
  • Take advantage of anhor text - text surrounding
    link itself.
  • Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page as
    one.
  • Disregard or minimize influence of links inside
    one domain.

16
Results from Clever system
  • Clever (HITS with enhancements) Vs. AltaVista Vs.
    Yahoo!
  • Comparison test on 26 broad topics.
  • 37 users ranked results (bad, fair, good,
    fantastic), not knowing the source
  • top 10 results from AltaVista
  • 5 Hubs 5 Authorities from Clever
  • 10 random pages from Yahoo! (as they appear in
    alphabetical order in search)

17
Results of experiment
  • 31 of responses show no difference between
    results quality.
  • 19 of responses evaluated Yahoo! higer than
    Clever.
  • 50 of responses evaluated Clever higher than
    Yahoo!

18
Usage trawling the Web
  • Looking for hidden cyber-communities in the Web
  • group of content creators sharing a common
    interest with set of Web pages
  • large number of small communities with narrow
    topic
  • even not registered in newsgroups or other
    catalogs

19
How to find cyber communities
  • Community - small group that has a dense pattern
    of linkage.
  • Each community has similar linkage signature.
  • Graph structure - directed bipartite graph

20
Results in communities search
  • Used web snapshot provided by Alexa.
  • Found about 130,000 complete bipartite graphs of
    3 webpages - coincident?
  • Manually tested sample of 400
  • only 5 had no common topic
  • 25 of communities were not represented in groups
    catalog

21
Conclusions
  • Mining of WWW link structure can uncover social
    networks.
  • Hubs and authority pages are more helpful in
    learning the topic than standard search.
  • Algorithm can find useful pages even if it does
    not describe itself as topic one.

22
Conclusions
  • Method of influenced weights is a basis for
    Google page rank algorithm.
  • New page category not returned by search engine
    HUB.
  • It is possible and useful to find structures in
    unstructured web.

23
Questions?
24
References
  • Mining the Link Structure of the World Wide Web
    (1999) - S. Chakrabarti, B. Dom, D. Gibson, J.
    Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan,
    A. Tomkins http//www.cs.cornell.edu/home/kleinbe
    r/ieee99.ps
  • Authoritative Sources in a Hyperlinked
    Environment - Jon M. Kleinberghttp//www.cs.corn
    ell.edu/home/kleinber/auth.pdf
  • Trawling emerging cyber-communities
    automatically - R. Kumar, P. Raghavan, S.
    Rajagopalan, A. Tomkinshttp//www8.org/w8-papers/
    4a-search-mining/trawling/trawling.html
Write a Comment
User Comments (0)
About PowerShow.com