Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten

Description:

... indexed by the major search engines. ... A crawler for a large search engine has to address two issues. ... Pages on a particular topic, images, mp3 file ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 12
Provided by: Xel2
Category:

less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler Vladislav Shkapenyuk, Torsten


1
Design and Implementation of a High-Performance
Distributed Web CrawlerVladislav Shkapenyuk,
Torsten Suel
  • ??? ???
  • ???
  • moonpfe_at_realtime.ssu.ac.kr

2
Table of Contents
  • 1. Introduction
  • 1.1 Crawling Applications
  • 1.2 Basic Crawler Structure
  • 1.3 Requirements for a Crawler
  • 1.4 Content of this Paper

3
1. Introduction(1/2)
  • Web search technology
  • Crawling strategies, storage, indexing, ranking
    techniques, the structural analysis of the web
    and web graph
  • High efficient crawling systems are needed.
  • Explosion in size of WWW
  • Download the hundreds of millions of web pages
    indexed by the major search engines.
  • size vs currency, quality vs response time

4
1. Introduction(2/2)
  • A crawler for a large search engine has to
    address two issues.
  • 1. It has to have a good crawling strategy.
  • 2. It needs to have a highly optimized system
    architecture
  • Download a large number of pages per second
  • ex) The Mercator system of AltaVista
  • In this paper,
  • Describe the design and implementation of an
    optimized system on a network of workstations.
  • Breadth-first crawl
  • The I/O and network efficiency aspects of a
    system, scalability issues are interested.

5
1.1 Crawling Applications(1/2)
  • Crawling strategies
  • Breadth-First Crawler
  • Start out at a small set of pages and then
    explore other pages by following links in a
    breadth first-like fashion.
  • Recrawling Pages for Updates
  • After pages are initially acquired, they may have
    to be periodically recrawled and checked updates.
  • heuristics -gt important pages, sites, domains
    more frequently
  • Focused Crawling
  • Focus only on certain types of pages
  • Pages on a particular topic, images, mp3 file
  • The goal of a focused crawler is to find many
    pages of interest without using a lot of
    bandwidth.

6
1.1 Crawling Applications(2/2)
  • Random Walking and Sampling
  • Use random walks on the web graph to sample pages
    or estimate the size and quality of search
    engines.
  • Crawling the Hidden Web
  • Hidden Web
  • Dynamic pages
  • Only be retrieved by posting appropriate queries
    and/or filling out forms on web pages.
  • Automatic access to Hidden Web

7
1.2 Basic Crawler Structure(1/2)
8
1.2 Basic Crawler Structure(2/2)
  • Two main components of crawler
  • Crawling application
  • The crawling application decides what pages to
    request next given the current state and the
    previosly crawled pages, and issues a stream of
    requests(URLs) to the crawling system.
  • Robot exclusion, speed control, DNS resolution
  • Crawling system
  • The crawling system downloads the requested pages
    and supplies them to the crawling application for
    analysis and storage.
  • Implements crawling strategies
  • Both crawling system and application can be
    replicated.
  • For higher performance

9
1.3 Requirements for a Crawler(1/2)
  • Flexibility
  • Use the system in a variety of scenarios, with
    few modifications
  • Low Cost and High Performance
  • Scale to several hundred pages per second and
    hundreds of millions of pages per run, and run on
    low cost hardware.
  • Robustness
  • Tolerate bad HTML, strange server behavior,
    configurations.
  • Tolerate crashes and network interruptions
    without losing the data

10
1.3 Requirements for a Crawler(2/2)
  • Etiquette and Speed Control
  • Robot exclusion ( robots.txt and robots meta
    tags)
  • Avoid putting too much load on a single server
  • 30 seconds interval
  • Throttle the speed on a domain level
  • Manageability and Reconfigurability
  • An appropriate interface is needed to monitor the
    crawl.
  • Administrator should be able to control the
    crawl.
  • Adjust Speed, add and remove components, shut
    down the system
  • After a crash or shutdown, we may want to
    continue the crawl using a different machine
    configuration.

11
1.4 Content of this Paper
  • Section 2 describes the architecture of our
    system and its major components.
  • Section 3 describes the data structures and
    algorithmic techniques that were used in more
    detail.
  • Section 4 presents preliminary experimental
    results.
  • Section 5 compares our design to that of other
    systems we know of.
  • Section 6 offers some concluding remarks.
Write a Comment
User Comments (0)
About PowerShow.com