The CLEVER Searching - PowerPoint PPT Presentation

About This Presentation
Title:

The CLEVER Searching

Description:

Let a(p) be the authority weight of page p. h(p) be the hub weight of p ... In cyberspace, competing authorities ignore each other and can be connected only by hubs. ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 68
Provided by: Lek2
Category:
Tags: clever | hub | searching

less

Transcript and Presenter's Notes

Title: The CLEVER Searching


1
The CLEVER Searching
IBM Almaden Research Center
2
The lecture outline
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Introduction
  • The HITS algorithm (Hyperlink Induced Topic
    Search)
  • The Automatic Resource Compiler
  • The CLEVER
  • Web-communities
  • The Focused Crawler

3
Introduction
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • 100s of millions pages on the WEB
  • Every day another million is added
  • More than a billion hyperlinks connecting them
  • The WEB lacks organization and structure
  • How can we find information ?

Traditional search engines !
4
Search Engines
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • maintain an index of words and pages containing
    them
  • use a ranking function to rank the pages.

5
The ranking function
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Most SE use thumb rules such as
  • The number of times the page contains the query
    word
  • The location of the word in the page
  • Giving more weight to words in titles or larger
    font

6
Some disadvantages
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Spamming using invisible fonts, repeating words
  • Polysemy - same word having multiple meanings
  • Synonymy - different words having the same
    meaning

A possible solution Semantic networks
7
Semantic networks
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Double edged sword - it helps with synonymy but
    aggravates polysemy
  • Expensive to build and maintain
  • Difficult to build many languages, new
    terminology

Another possible solution Human selected pages
(YAHOO)
8
Some disadvantages
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Countless possible queries
  • individual judgement
  • almost impossible as the WEB grows by a million
    pages a day
  • Example fishing

The CLEVER solution using hyperlinks
9
The HITS Algorithm(1997)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Let the WEB be a directed graph
  • - nodes static HTML pages
  • - edges links
  • The average node has seven outgoing edges

10
The HITS Algorithm(1997)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Two kinds of usefull pages
  • Authority page - contains a lot of information
    about the query topic
  • Hub page - contains a large number of links to
    pages containing info

11
The HITS Algorithm(1997)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The concept
  • A good hub points to many good authorities
  • A good authority is pointed to by many good hubs

The Goal To find the best H A about the topic
12
The HITS Algorithm(1997)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The algorithm has two main stages
  • The sampling stage - constructing a collection of
    pages in which we will search for H A
  • The weight propagation stage - assigning
    numerical scores of H A to every page

13
The sampling stage
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Creating a root set of 200 pages using AltaVista
  • Expanding the root set to the base set of 2000
    pages
  • deleting links between pages in the same site

The result a subgraph G which is rich in H A
14
The weight propagation stage
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Let a(p) be the authority weight of page p
  • h(p) be the hub weight of p
  • 1. for every page p h(p) 1
  • 2. Repeat k times
  • for every p a(p) Sq-p h(q)
  • for every p h(p) Sp-q a(q)

15
Why using hubs ?
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Two main reasons
  • Hubs are places from which you can start
    searching
  • In cyberspace, competing authorities ignore each
    other and can be connected only by hubs.

16
The ARC
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • If computations were not a bottleneck,
  • what would be the most effective search
  • algorithm ?

Automatic Resource Compiler (ARC)
17
The ARC
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • ARC is a system which, given a topic that is
    broad and well represented on the WEB, will seek
    out and return a list of WEB resources that it
    considers the most authoritative.

The goal to compile lists similar to those
provided by YAHOO or Infoseek
18
The Algorithm
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Performs a local analysis of both text and
  • links to arrive at a global consensus of
  • the best resources for the topic.
  • Three phases
  • search and growth
  • weighting
  • iteration and reporting

19
Search and growth
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Collecting a root set from Altavista using the
    query terms.
  • Augmenting the root set twice by adding
  • - pages that point to the documents in the
  • root set
  • - pages that are pointed to by documents
  • in the root set

20
Weighting
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The concept The text around the anchor in
  • a page p that links to a page q is descriptive
  • of the contents of q.
  • Let w(p,q) be a positive numerical weight
  • that reflects the amount of topic-related text
  • in the vicinity of the anchor.

21
Iteration report
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • for every page p h(p) 1
  • Repeat k times
  • for every p a(p) Sq-p w(q,p) h(q)
  • for every p h(p) Sp-q w(p,q) a(q)
  • return the best 15 hubs and the best 15
    authorities

22
Computing w(p,q)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • What precisely is vicinity ?
  • Let us define the anchor window as a window of B
    bytes before and after the HREF
  • The distance between Yahoo and www.yahoo.com

B is set to 50
23
Computing w(p,q) - cont.
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • How to map occurences of descritive text in the
    anchor window into weights ?
  • Let n(t) denote the number of matches between
    terms in the topic and in the anchor window.

w(p,q) 1 n(t)
24
Analysis
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The weighting process the iterative computation
    take a few seconds
  • The bottleneck computing the root augmented
    sets.

Conclusion The system will not field 1000s of
queries per second and produce answers in real
time.
25
Experiments
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Well known sources to compare with ARC
    Yahoo Infoseek
  • Topics
  • - topics from the directories of YAHOO
  • - 28 topics, each containing 2-3 words
  • - Examples cheese,classical guitar, Gulf
  • War

26
Experiments-cont.
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Volunteers
  • - each one evaluated two topics
  • - every topic was evaluated by two persons
  • The evaluation form

27
Results
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • ARCs lists are almost competitive with
  • the lists of YAHOO and Infoseek
  • The combination of link text analysis
  • is successful !

28
New improvements (1999)
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Seeking a final set of HA that provide good
    access to a wide variety of information
  • Returning only a single point-of-entry into a WEB
    source
  • Identifying interesting sections of WEB pages and
    use this to determine which other pages might be
    good HA (Ex. Mango fruit)

The improved system is called CLEVER
29
User study
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Comparing the ten top results of CLEVER, YAHOO
    Altavista on the same 27 topics
  • 37 participants rank every page as fantastic,
    good, fair, bad
  • Every SE gets 1 point if its page was ranked
    good or fantastic

30
The results(1)
31
The results(2)
CLEVER performs better than YAHOO and Altavista
32
Summary
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • HITS
  • CLEVER performs better than YAHOO and Altavista
  • The next stage - building resource lists such as
    YAHOO.
  • A usefull by-product of the developing of CLEVER
    is the seperation of WEB pages into clusters
    (communities)

ARC
CLEVER
33
Inferring web communities from link topology
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • WWW - decentralized almost anarchic
  • growth process.
  • - result large hyper-linked corpus
  • lacks a traditional
    logical
  • organization.
  • TARGET find an order in dis-order , or
  • extract meaningful structures
  • from the hyper-linked
    environment.

34
Inferring web communities from link topology
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The notion search for hyper-linked communities
    through analysis
  • of the link topology, as a way of
  • addressing issues such as
  • -navigation.
  • -information discovery.
  • -web sociology, ext.

35
COMMUNITIESdensely interconnected sets of
hubs and authorities.
36
Inferring web communities from link topology
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The mean for the search
  • HITS algorythm.
  • Activating the algorythm on a specified
  • subject reveal a community of hubs
  • and authorities related to the subject.

37
Inferring web communities from link topology
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • What kinds of communities are discovered by HITS
    ?
  • How do the communities discovered depend on the
    choice of root-set?
  • How quickly do the communities crystallize as the
    number of iterations grow?

38
How quickly do communities crystalize ?
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • principal community a set of the top 10
    authorities and top 10 hubs (marked C).
  • C(R,N) principal community obtained by
  • running HITS for N iterations
    with
  • an initial root-set of R
    pages.
  • Empiric results most communities become stable
    after predictably 200 iterations and from
    root-set of 50.

39
How quickly do communities crystallize ?
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Therefore define CC(200,50)
  • and run tests over different Rs and Ns
  • for 6 representative topics
  • 1. Harvard 2.cryptography 3.English
    literature
  • 4. Skiing 5.Optimization 6.Operations
    research
  • Look at the overlap between C(N,R) C, and see
    which community converges more quickly (with
    smaller Ns Rs).

40
How quickly do communities crystallize ?
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • RESULTS
  • principal-community for cryptography
    crystallizes very quickly (central area in CS),
    but for skiing and English literature the
    process is remarkelly slow. topics like
    operations research are somewhere in between.
  • How can this be explained?

41
Main observations emerging from the tests analysis
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Communities on broad topics usually have quite a
    robust structure (relatively independent of the
    root-set choice).
  • HITS success depends on breadth of the topic and
    the discipline of human knowledge under which it
    falls. (density of hyperlinks in
    discipline like CS is greater than in academic
    humanities).

42
Main observations emerge from the tests analysis
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • IMPORTANT
  • the greatest degree of orderly structure, as
    extracted by HITS, is found in communities for
    which the number of relevant pages, and the
    density of hyperlinking, is the largest !
  • (crypto VS. English literature).

43
Main observations emerge from the tests analysis
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The standard view
  • the web is becoming increasingly chaotic and
    difficult to model
  • Observation
  • HITS become more and more effective as the size
    of the web continues to increase
  • Consequence
  • we can make predictive statements about less
    linked communities based on current experience
    with highly-linked ones

44
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Robustness
  • on broad topics - stable and robust communities,
    despite a very small initial root-set.
  • Methods used to diverse root-sets
  • - querying multiple search-engines for the
    initial root-set.
  • - use same term in different languages.

45
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • RESULT
  • The main communities tend to recur in all the
    experiments, regardless of how the root set is
    constructed.

46
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • TOPIC GENERALIZATION
  • HITS tends to generalize topics that are not
    sufficiently broad.
  • WHAT DOES THAT MEAN???
  • It means that the principal-community of such
    subject will be relevant to a topic which
    includes, but larger then, the initial subject
    given to HITS.

47
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • TOPIC GENERALIZATION
  • topic enough broad Micheal Jordan
  • HITS give good and relevant results.
  • Topic Dennis Ritchie (author of C)
  • HITS results (3 top authorities)
  • www.cm.cf.ac.uk/Dave/C/CE.html
  • www.cyberdiem.com/vin/learn.html
  • www.lysator/liu.se/c/index.html
  • All pages on the C language itself!
  • But where is Ritchie ???

48
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • swallowed up by the subject
  • to which he most prominently
  • belongs
  • and there are more examples.

49
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Generalization allows an automatic
    characterization of specific subjects in terms of
    their generalizations.

50
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • TREE OF TOPICS
  • picturize generalization as occuring on a tree
    of topic where most-general topic are close to
    the root, and their sons are sub-topics.
  • YAHOO - ex. of an hand-made searchable
    hierarchy.
  • HITS gives us a way to collect info. about such
    trees automaticaly.

51
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • OTHER FACTORS AFFECTING GENERALIZATION
  • 1) Web-centric sub-topics
  • 2) commercialization

52
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Web-centric sub-topics
  • generality is determined by the representation
    of the topic on the www.
  • many pages are most concerned in topic that
    involves the web-itself.
  • Thus, HITS may focus on a certain community
    because it represents a more web-centric
    version of the given topic.

53
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Example for the topic linguistics
  • the top authorities were
  • www.cs.columbia.edu/acl//home.html
  • www.cs.columbia.edu/radev/cgi-bin/universe.cgi
  • www.ling.rochester.edu/linglinks.html
  • The first 2 are strong authorities for
    computational linguistics, only a sub-topic of
    the requested topic, but this sub-topic is more
    linked on the web (more related to CS and the web
    itself).

54
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • COMMERCIAL/ADVERTISING INFLUENCES
  • topics with both commercial and individual
    involvement, the authorities in the principal
    communities are the commercialized pages.

55
The structure of communities
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • MORE OBSERVATIONS
  • 1) infiltration of pages of AltaVista, YAHOO, and
    www.microsoft.com into innocent communities.
  • 2) temporal influence of short-term issues.
  • (ex Harvard Conference on the Internet and
    Society was prominent with Harvard in 1/97 but
    no in 8/97).

56
Conclusions
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The web is less chaotic then seems.
  • HITS gives convenient way to analyze the webs
    link topology and to reveal the hyperlinked
    communities on the web that appear to span a wide
    range of interests and discipline.
  • Using HITS, one can study the sociology of the
    web and get a global understanding of how it is
    being constructed and how it behaves.

57
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK

58
Focused Crawling
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • 2 forces shaping the future of the web
  • 1. Exploding volume of the WWW.
  • 2. Growing mass of users who use
  • the web for serious research.
  • small is beautiful
  • specialized search portals VS.
  • one-size-fits-all portals (Alta-Vista)

59
Focused Crawling
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The question
  • can a focused portal be built automatically?
  • The answer
  • yes! using a focused crawler.

60
Focused Crawling
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • The focused crawler seeks, aquires, indexes and
    maintains pages on a specific set of topics that
    represent a relatively narrow segment of the web
  • -for serious web users, focused portholes are
    more useful than generic portals.

61
Focused Crawling
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Operation synopsis
  • Setting an initial root-set of good example
    pages by the user (using an exiting taxonomy as
    base).
  • Crawling from root-set using 2 disciplines
  • - scoring the relevance of each new reached page
    to the initial group (classifying).
  • - estimating benefit of crawling out from the
    pages out-links (distillation).
  • Combining the power of content links.

62
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK

63
Focused Crawling
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Important notions
  • - users feedback
  • - supervised learning
  • - high harvest-rate
  • (fraction of page fetches
  • relevant to users interest)
  • - keeping focused on subject
  • example Cycling

64
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK

65
Focused Crawling - summary
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • Learning from examples mechanism
  • Efficient and close-to-subject crawling, ignoring
    unrelevant segments of the web
  • Accessing further relevant segments of the web
    while getting into more deep search
  • giving a good answer for the need of specialized
    and filtered web-libraries

66
Publications
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • S. Chakrabari, B. Dom, D. Gibson, J. Kleinberg,
    P. Raghavan, and S. Rajagopalan. Automatic
    Resource Compilation by Analyzing Hyperlink
    Structure and Associated Text.
  • D. Gibson, J. Kleinberg, and P. Raghavan.
    Inferring Web Communities from Link Topologies.
  • S. Chakrabari, B. Dom, D. Gibson, J. Kleinberg,
    S.R. Kumar, P. Raghavan, S. Rajagopalan and A.
    Tomkins. HyperSearching the web.
  • S. Chakrabari, M. Van den Berg, B. Dom. Focused
    crawling a new approach to topic specific
    resource discovery.

67
References
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK
  • http//decweb.ethz.ch/WWW7/1898/com1898.html
  • http//www.almaden.ibm.com/cs/k53/abstract.html
  • http//www.almaden.ibm.com/cs/k53/clever.html
  • http//www.cs.berkeley.edu/soumen/0699raghavan.ht
    ml
  • http//www.cs.berkeley.edu/soumen/doc/www99focus/
    html
Write a Comment
User Comments (0)
About PowerShow.com