Recent Results in Automatic Web Resource Discovery - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Recent Results in Automatic Web Resource Discovery

Description:

Distiller: identifies the centrality of crawled pages to determine visit priorities ... Depend on large, comprehensive Web crawls and indices (Post processing) ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 18
Provided by: cui1
Category:

less

Transcript and Presenter's Notes

Title: Recent Results in Automatic Web Resource Discovery


1
Recent Results in Automatic Web Resource Discovery
  • Soumen Chakrabartiv
  • Presentation by Cui Tao

2
Introduction
  • Classical IR
  • Indexing a collection of documents
  • Answering queries by returning a ranked list of
    relevant document
  • Problems for retrieve online document
  • Ambiguity
  • Context sensitivity
  • Synonymy
  • Polysemy
  • Large amount of relevant Web pages

3
Introduction
  • Directory-based topic browsing
  • tree-like structure
  • Most Maintained by human expert
  • Advantages exemplary, influential
  • Disadvantages slow, subjective and noisy

4
Introduction
  • Standard crawler and search engine
  • 1997 cover 35-40 out of 340 million Web pages
  • 1999 cover 18 out of 800 million Web pages
  • Cannot be used for maintaining generic portals
    and automatic resource discovery

5
Introduction
  • Focused crawler
  • Can selectively seek out pages that are relevant
    to pre-defined set of topics
  • Experts and researchers preferred
  • Two modules
  • Classifier analyzes the text in and links around
    a given web page and automatically assigns it to
    suitable directories in a web catalog
  • Distiller identifies the centrality of crawled
    pages to determine visit priorities

6
Distillation techniques
  • Google
  • Simulate a random wander on the Web
  • Ranked by pre-computed popularity and visitation
    rate
  • fast

7
Distillation techniques
  • HITS (Hyperlink Induced Topic Search)
  • Depends on a search engine
  • Combine two scores
  • Authorities identify pages with useful
    information about a topic
  • Hubs identify pages that contain many links to
    pages with useful information on the topic
  • Query dependent and slow
  • May lead topic contamination or drift

8
Distillation techniques
  • ARC and CLEVER
  • ARC (Automatic Resource Complier) part of CLEVER
  • Root set was expanded by 2 links instead of 1link
  • ( Including all pages which are link-distance
    two or less from at least one page in the root
    set )
  • Assign weights to the hyperlinks base on the
    match between the query and the text surrounding
    the hyperlink in the source document

9
Distillation techniques
  • Outlier filtering
  • Computes relevance weights for pages using Vector
    Space Model
  • All pages whose weights are below a threshold are
    pruned
  • Effectively prune away outlier nodes in the
    neighborhood, thus avoid contamination

10
Topic distillation vs. Resource discovery
  • Topic distillation
  • Depend on large, comprehensive Web crawls and
    indices (Post processing)
  • Can be used to generate a Web taxonomy?
  • Set a keyword query for each node in the taxonomy
  • Run a distillation program
  • Simple but have some problems

11
Topic distillation vs. Resource discovery
  • Problems
  • Construction the query involves trial, error and
    complicated thought
  • Query North American telecommunication
    companies
  • Query "power suppl" ßwitch mode" smps
    -multiprocessor üninterrupt power suppl" ups
    -parcel
  • The Yahoo! node /BusinessEconomy /Companies
    /Electronics /PowerSupplies
  • To match the directory based browsing quality
    of
  • Yahoo! 7.03 terms and 4.34 operators
  • Alta Vista 2.35 terms and 0.41 operators

12
Topic distillation vs. Resource discovery
  • Problems
  • Contamination
  • stop-sites not automatic
  • terming weighting
  • edge weighing no precise algorithm to set the
    weight
  • Topic distillation by itself is not enough for
    resource discovery

13
Hypertext classification learning from example
  • Adding example pages and their distance-1
    neighbors into the graph to be distilled will
    improve the result
  • The contents of the given example and its
    neighbors provide a way to compute the decision
    boundary of classification
  • NN, Bayesian and support vector classifiers

14
Hypertext classification
  • Link-based features important
  • Circular topic influence
  • Topic of one page influences its text and its
    neighbor pages topic
  • Knowledge of the linked vicinitys topic provides
    clues for the test documents topic
  • Bibliometric, more general than the simple linear
    endorsement model used in topic distillation

15
Putting it together for resource discovery

16
Conclusion
  • Emphasized the importance of scalable automatic
    resource discovery
  • Argued that common search engines are not
    adequate to achieve the resource discovery
  • Introduced the recently invented focused crawling
    system

17
Future Works
  • How to derive the training examples
    automatically?
  • How to personalize the outcome of focused crawler
    for users?
Write a Comment
User Comments (0)
About PowerShow.com