CS 430 / INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430 / INFO 430 Information Retrieval

Description:

A focused web crawler downloads only those pages whose content satisfies some criterion. ... and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 32
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430 / INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 15 Web Search 1
2
Course Administration
Assignment 2 Grades have been sent
out Assignment 3 Has been posted Midterm has
been graded If you have any questions please
come to office hours
3
Web Search
Goal Provide information discovery for large
amounts of open access material on the
web Challenges Volume of material -- several
billion items, growing steadily Items created
dynamically or in databases Great variety --
length, formats, quality control, purpose, etc.
Inexperience of users -- range of needs
Economic models to pay for the service
Mischievous Web sites
4
Strategies
Subject hierarchies Use of human indexing --
Yahoo! (original) Web crawling automatic
indexing General -- Infoseek, Lycos,
AltaVista, Google, Yahoo! (current) Mixed
models Human directed web crawling and
automatic indexing -- iVia/NSDL
5
Components of Web Search Service
Components Web crawler Indexing
system Search system Advertising
system Considerations Economics Scalability
Legal issues
6
Lectures and Classes
Discussion 6 Web crawling Lecture 15 Web
crawling Discussion 7 Ranking Web
documents Lecture 16 Graphical methods of
ranking Lecture 17 Building a large-scale search
engine Discussion 8 File systems Lecture
18 Advertising, spam, Web business Lectures
23-25 User interface considerations
7
Web Searching Architecture
Documents stored on many Web servers are
indexed in a single central index. (This is
similar to a union catalog.) The central index
is implemented as a single system on a very large
number of computers
Build index
Crawl
Search
Index to all Web pages
Web pages retrieved by crawler
Examples Google, Yahoo!
8
What is a Web Crawler?
Web Crawler A program for downloading web
pages. Given an initial set of seed URLs, it
recursively downloads every page that is linked
from pages in the set. A focused web crawler
downloads only those pages whose content
satisfies some criterion. Also known as a web
spider
9
Simple Web Crawler Algorithm
Basic Algorithm Let S be set of URLs to pages
waiting to be indexed. Initially S is is a set
of known seeds. Take an element u of S and
retrieve the page, p, that it references. Parse
the page p and extract the set of URLs L it has
links to. Update S S L - u Repeat as many
times as necessary. Large production crawlers
may run continuously
10
Not so Simple
  • Performance -- How do you crawl 10,000,000,000
    pages?
  • Politeness -- How do you avoid overloading
    servers?
  • Legal -- What if the owner of a page does not
    want the crawler to index it?
  • Failures -- Broken links, time outs, spider
    traps.
  • Strategies -- How deep do we go? Depth first or
    breadth first?
  • Implementations -- How do we store and update S
    and the other data structures needed?

11
What to Retrieve
  • No web crawler retrieves everything
  • All crawlers retrieve
  • HTML (leaves and nodes in the tree)
  • ASCII clear text (only as leaves in the tree)
  • Most retrieve
  • PDF
  • PostScript,
  • Importance metrics (Discussion Class 6)
  • Interest driven
  • Popularity driven
  • Location driven

12
Robots Exclusion
The Robots Exclusion Protocol A Web site
administrator can indicate which parts of the
site should not be visited by a robot, by
providing a specially formatted file on their
site, in http//.../robots.txt. The Robots META
tag A Web author can indicate if a page may or
may not be indexed, or analyzed for links,
through the use of a special HTML META tag See
http//www.robotstxt.org/wc/exclusion.html
13
Robots Exclusion
Example file /robots.txt Disallow allow all
robots User-agent Disallow /cyberworld/map/
Disallow /tmp/ these will soon
disappear Disallow /foo.html To allow
Cybermapper User-agent cybermapper Disallow
14
Extracts fromhttp//www.nytimes.com/robots.txt
robots.txt, www.nytimes.com 3/24/2005 User-agent
Disallow /college Disallow
/reuters Disallow /cnet Disallow
/partners Disallow /archives Disallow
/indexes Disallow /thestreet Disallow
/nytimes-partners Disallow /financialtimes Allow
/2004/ Allow /2005/ Allow /services/xml/ User-
agent Mediapartners-Google Disallow
15
The Robots META tag
The Robots META tag allows HTML authors to
indicate to visiting robots if a document may be
indexed, or used to harvest more links. No server
administrator action is required. Note that
currently only a few robots implement this. In
this simple example ltmeta name"robots"
content"noindex, nofollow"gt a robot should
neither index this document, nor analyze it for
links. http//www.robotstxt.org/wc/exclusion.html
meta
16
High Performance Web Crawling
The web is growing fast To crawl a billion
pages a month, a crawler must download about 400
pages per second. Internal data structures must
scale beyond the limits of main
memory. Politeness A web crawler must not
overload the servers that it is downloading from.
17
Example Mercator and Heritrix Crawlers
Altavista was a research project and production
Web search engine developed by Digital Equipment
Corporation. Mercator was a high-performance
crawler for production and research. Mercator
was developed by Allan Heydon, Marc Njork, Raymie
Stata and colleagues at Compaq Systems Research
Center (continuation of work of Digital's
AltaVista group). Heritrix is a high-performance,
open-source crawler developed by Raymie Stata and
colleagues at the Internet Archive. (Stata is now
at Yahoo!) Mercator and Heritrix are described
together, but there are major implementation
differences.
18
(No Transcript)
19
Mercator/Heritrix Design Goals
Broad crawling Large, high-bandwidth crawls to
sample as much of the Web as possible given the
time, bandwidth, and storage resources
available. Focused crawling Small- to
medium-sized crawls (usually less than 10 million
unique documents) in which the quality criterion
is complete coverage of selected sites or
topics. Continuous crawling Crawls that revisit
previously fetched pages, looking for changes and
new pages, even adapting its crawl rate based on
parameters and estimated change
frequencies. Experimental crawling Experiment
with crawling techniques, such as choice of what
to crawl, order of crawled, crawling using
diverse protocols, and analysis and archiving of
crawl results.
20
Mercator/Heritrix
Design parameters Extensible. Many components
are plugins that can be rewritten for different
tasks. Distributed. A crawl can be
distributed in a symmetric fashion across many
machines. Scalable. Size of within memory data
structures is bounded. High performance.
Performance is limited by speed of Internet
connection (e.g., with 160 Mbit/sec connection,
downloads 50 million documents per
day). Polite. Options of weak or strong
politeness. Continuous. Will support
continuous crawling.
21
Mercator/Heritrix Main Components
Scope Determines what URIs are ruled into or out
of a certain crawl. Includes the seed URIs used
to start a crawl, plus the rules to determine
which discovered URIs are also to be scheduled
for download. Frontier Tracks which URIs are
scheduled to be collected, and those that have
already been collected. It is responsible for
selecting the next URI to be tried, and prevents
the redundant rescheduling of already-scheduled
URIs. Processor Chains Modular Processors that
perform specific, ordered actions on each URI in
turn. These include fetching the URI, analyzing
the returned results, and passing discovered URIs
back to the Frontier.
22
Mercator
Notation URL Frontier Set S of URLs to be
crawled DNS Domain Name Service RIS Rewind
Input Stream
23
Mercator/Heritrix Main Components
Crawling is carried out by multiple worker
threads, e.g., 500 threads for a big crawl. The
URL frontier stores the list of absolute URLs to
download. The DNS resolver resolves domain
names into IP addresses. Protocol modules
download documents using appropriate protocol
(e.g., HTML). Link extractor extracts URLs from
pages and converts to absolute URLs. URL filter
and duplicate URL eliminator determine which URLs
to add to frontier.
24
Mercator/Heritrix The URL Frontier
A repository with two pluggable methods add a
URL, get a URL. Most web crawlers use variations
of breadth-first traversal, but ... Most URLs
on a web page are relative (about 80). A
single FIFO queue, serving many threads, would
send many simultaneous requests to a single
server. Weak politeness guarantee Only one
thread allowed to contact a particular web
server. Stronger politeness guarantee Maintain n
FIFO queues, each for a single host, which feed
the queues for the crawling threads by rules
based on priority and politeness factors.
25
Building a Web Crawler Links are not Easy to
Extract and Record
Keeping track of the URLs that have been visited
is a major component of a crawler
  • Relative/Absolute
  • Many strings are equivalent and can refer to
    the same file
  • Mirrors duplicate Web sites or parts of Web
    sites
  • CGI parameters
  • Dynamic generation of pages
  • Server-side scripting
  • Server-side image maps
  • Links buried in scripting code

26
Mercator/Heritrix Duplicate URL Elimination
Duplicate URLs are not added to the URL
Frontier Requires efficient data structure to
store the URL Set, all URLs that have been seen
and to check a new URL. In memory Represent URL
by 8-byte checksum. Maintain in-memory hash
table of URLs. Requires 5 Gigabytes for 1 billion
URLs. Disk based Combination of disk file and
in-memory cache with batch updating to minimize
disk head movement.
27
Example Mercator's URL-Seen Test
  • All URLs that have been seen are stored in the
    URL Set.
  • Represented by fixed size check sum
  • In memory cache with least recently used
    replacement
  • Heydon and Najork reported in 1999 a cache hit
    rate of 75.7
  • using an in-memory cache with 218 entries and a
    table of
  • recently added URLs.

28
Mercator/Heritrix Domain Name Lookup
Resolving domain names to IP addresses is a major
bottleneck of web crawlers. Approach
Separate DNS resolver and cache on each crawling
computer. Create multi-threaded version of
DNS code (BIND). In Mercator, these changes
reduced DNS loop-up from 70 to 14 of each
thread's elapsed time.
29
(No Transcript)
30
Research Topics in Web Crawling
  • How frequently to crawl and what strategies to
    use.
  • Identification of anomalies and crawling traps.
  • Strategies for crawling based on the content of
    web pages (focused and selective crawling).
  • Duplicate detection.

31
Further Reading
Heritrix http//crawler.archive.org/ Allan Heydon
and Marc Najork, Mercator A Scalable, Extensible
Web Crawler. World Wide Web 2(4)219-229,
December 1999. http//research.microsoft.com/najo
rk/mercator.pdf
Write a Comment
User Comments (0)
About PowerShow.com