Recent Results in Automatic Web Resource Discovery - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Recent Results in Automatic Web Resource Discovery

Description:

Number of Views:14

Avg rating:3.0/5.0

Slides: 18

Provided by: cui1

Category:

more less

Transcript and Presenter's Notes

Title: Recent Results in Automatic Web Resource Discovery

1
Recent Results in Automatic Web Resource Discovery

2
Introduction

3
Introduction

4
Introduction

5
Introduction

Focused crawler
Can selectively seek out pages that are relevant
to pre-defined set of topics
Experts and researchers preferred
Two modules
Classifier analyzes the text in and links around
a given web page and automatically assigns it to
suitable directories in a web catalog
Distiller identifies the centrality of crawled
pages to determine visit priorities

6
Distillation techniques

7
Distillation techniques

HITS (Hyperlink Induced Topic Search)
Depends on a search engine
Combine two scores
Authorities identify pages with useful
information about a topic
Hubs identify pages that contain many links to
pages with useful information on the topic
Query dependent and slow
May lead topic contamination or drift

8
Distillation techniques

ARC and CLEVER
ARC (Automatic Resource Complier) part of CLEVER
Root set was expanded by 2 links instead of 1link
( Including all pages which are link-distance
two or less from at least one page in the root
set )
Assign weights to the hyperlinks base on the
match between the query and the text surrounding
the hyperlink in the source document

9
Distillation techniques

Outlier filtering
Computes relevance weights for pages using Vector
Space Model
All pages whose weights are below a threshold are
pruned
Effectively prune away outlier nodes in the
neighborhood, thus avoid contamination

10
Topic distillation vs. Resource discovery

11
Topic distillation vs. Resource discovery

Query "power suppl" ßwitch mode" smps
-multiprocessor üninterrupt power suppl" ups
-parcel
The Yahoo! node /BusinessEconomy /Companies
/Electronics /PowerSupplies

12
Topic distillation vs. Resource discovery

13
Hypertext classification learning from example

Adding example pages and their distance-1
neighbors into the graph to be distilled will
improve the result
The contents of the given example and its
neighbors provide a way to compute the decision
boundary of classification
NN, Bayesian and support vector classifiers

14
Hypertext classification

Link-based features important
Circular topic influence
Topic of one page influences its text and its
neighbor pages topic
Knowledge of the linked vicinitys topic provides
clues for the test documents topic
Bibliometric, more general than the simple linear
endorsement model used in topic distillation

15
Putting it together for resource discovery

16
Conclusion

Emphasized the importance of scalable automatic
resource discovery
Argued that common search engines are not
adequate to achieve the resource discovery
Introduced the recently invented focused crawling
system

17
Future Works