Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten

About This Presentation

Title:

Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten

Description:

... indexed by the major search engines. ... A crawler for a large search engine has to address two issues. ... Pages on a particular topic, images, mp3 file ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 12

Provided by: Xel2

Category:

more less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten

1
Design and Implementation of a High-Performance
Distributed Web CrawlerVladislav Shkapenyuk,
Torsten Suel

??? ???
???
moonpfe_at_realtime.ssu.ac.kr

2
Table of Contents

1. Introduction
1.1 Crawling Applications
1.2 Basic Crawler Structure
1.3 Requirements for a Crawler
1.4 Content of this Paper

3
1. Introduction(1/2)

Web search technology
Crawling strategies, storage, indexing, ranking
techniques, the structural analysis of the web
and web graph
High efficient crawling systems are needed.
Explosion in size of WWW
Download the hundreds of millions of web pages
indexed by the major search engines.
size vs currency, quality vs response time

4
1. Introduction(2/2)

A crawler for a large search engine has to
address two issues.
1. It has to have a good crawling strategy.
2. It needs to have a highly optimized system
architecture
Download a large number of pages per second
ex) The Mercator system of AltaVista
In this paper,
Describe the design and implementation of an
optimized system on a network of workstations.
Breadth-first crawl
The I/O and network efficiency aspects of a
system, scalability issues are interested.

5
1.1 Crawling Applications(1/2)

Crawling strategies
Breadth-First Crawler
Start out at a small set of pages and then
explore other pages by following links in a
breadth first-like fashion.
Recrawling Pages for Updates
After pages are initially acquired, they may have
to be periodically recrawled and checked updates.
heuristics -gt important pages, sites, domains
more frequently
Focused Crawling
Focus only on certain types of pages
Pages on a particular topic, images, mp3 file
The goal of a focused crawler is to find many
pages of interest without using a lot of
bandwidth.

6
1.1 Crawling Applications(2/2)

Random Walking and Sampling
Use random walks on the web graph to sample pages
or estimate the size and quality of search
engines.
Crawling the Hidden Web
Hidden Web
Dynamic pages
Only be retrieved by posting appropriate queries
and/or filling out forms on web pages.
Automatic access to Hidden Web

7
1.2 Basic Crawler Structure(1/2)
8
1.2 Basic Crawler Structure(2/2)

Two main components of crawler
Crawling application
The crawling application decides what pages to
request next given the current state and the
previosly crawled pages, and issues a stream of
requests(URLs) to the crawling system.
Robot exclusion, speed control, DNS resolution
Crawling system
The crawling system downloads the requested pages
and supplies them to the crawling application for
analysis and storage.
Implements crawling strategies
Both crawling system and application can be
replicated.
For higher performance

9
1.3 Requirements for a Crawler(1/2)

Flexibility
Use the system in a variety of scenarios, with
few modifications
Low Cost and High Performance
Scale to several hundred pages per second and
hundreds of millions of pages per run, and run on
low cost hardware.
Robustness
Tolerate bad HTML, strange server behavior,
configurations.
Tolerate crashes and network interruptions
without losing the data

10
1.3 Requirements for a Crawler(2/2)

Etiquette and Speed Control
Robot exclusion ( robots.txt and robots meta
tags)
Avoid putting too much load on a single server
30 seconds interval
Throttle the speed on a domain level
Manageability and Reconfigurability
An appropriate interface is needed to monitor the
crawl.
Administrator should be able to control the
crawl.
Adjust Speed, add and remove components, shut
down the system
After a crash or shutdown, we may want to
continue the crawl using a different machine
configuration.

11
1.4 Content of this Paper

Section 2 describes the architecture of our
system and its major components.
Section 3 describes the data structures and
algorithmic techniques that were used in more
detail.
Section 4 presents preliminary experimental
results.
Section 5 compares our design to that of other
systems we know of.
Section 6 offers some concluding remarks.