Information Retrieval (9) - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Information Retrieval (9)

Description:

... web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 22

Provided by: rade84

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval (9)

1
Information Retrieval(9)

Prof. Dragomir R. Radev
radev_at_umich.edu

2
IR Winter 2010
14. Webometrics The Bow-tie model
3
Brief history of the Web

FTP/Gopher
WWW (1989)
Archie (1990)
Mosaic (1993)
Webcrawler (1994)
Lycos (1994)
Yahoo! (1994)
Google (1998)

4
Size

The Web is the largest repository of data and it
grows exponentially.
320 Million Web pages Lawrence Giles 1998
800 Million Web pages, 15 TB Lawrence Giles
1999
20 Billion Web pages indexed now
Amount of data
roughly 200 TB Lyman et al. 2003

5
Zipfian properties

In-degree
Out-degree
Visits to a page

6
Bow-tie model of the Web
TEND44M
SCC56 M
OUT44 M
IN44 M
24 of pagesreachable froma given page
DISC17 M
Bröder al. WWW 2000, Dill al. VLDB 2001
7
Measuring the size of the web

Using extrapolation methods
Random queries and their coverage by different
search engines
Overlap between search engines
HTTP requests to random IP addresses

8
Bharat and Broder 1998

Based on crawls of HotBot, Altavista, Excite, and
InfoSeek
10,000 queries in mid and late 1997
Estimate is 200M pages
Only 1.4 are indexed by all of them

9
Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
10
What makes Web IR different?

Much bigger
No fixed document collection
Users
Non-human users
Varied user base
Miscellaneous user needs
Dynamic content
Evolving content
Spam
Infinite sized size is whatever can be indexed!

11
IR Winter 2010
15. Crawling the Web Hypertext retrieval
Web-based IR Document closures
Focused crawling
12
Web crawling

The HTTP/HTML protocols
Following hyperlinks
Some problems
Link extraction
Link normalization
Robot exclusion
Loops
Spider traps
Server overload

13
Example

U-Ms root robots.txt file
http//www.umich.edu/robots.txt
User-agent
Disallow /websvcs/projects/
Disallow /7Ewebsvcs/projects/
Disallow /homepage/
Disallow /7Ehomepage/
Disallow /smartgl/
Disallow /7Esmartgl/
Disallow /gateway/
Disallow /7Egateway/

14
Example crawler

E.g., poacher
http//search.cpan.org/neilb/Robot-0.011/examples
/poacher
Included in clairlib

15
ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)

follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test

process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error

process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
16
Focused crawling

Topical locality
Pages that are linked are similar in content (and
vice-versa Davison 00, Menczer 02, 04, Radev et
al. 04)
The radius-1 hypothesis
given that page i is relevant to a query and that
page i points to page j, then page j is also
likely to be relevant (at least, more so than a
random web page)
Focused crawling
Keeping a priority queue of the most relevant
pages

17
Challenges in indexing the web

Page importance varies a lot
Anchor text
User modeling
Detecting duplicates
Dealing with spam (content-based and link-based)

18
Duplicate detection

Shingles
TO BE OR
BE OR NOT
OR NOT TO
NOT TO BE
The use the Jaccard coefficient (size of
intersection/size of union) to determine
similarity
Hashing
Shingling (separate lecture)

19
Document closures for QA
spain
spain
Madrid
capital
capital
L
P
P
20
Document closures for IR
University of
Michigan
PhysicsDepartment
Michigan
Physics
L
P
P
21
The link-content hypothesis