Design and Implementation of a HighPerformance Distributed Web Crawler - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Design and Implementation of a HighPerformance Distributed Web Crawler

Description:

Design and Implementation of a High-Performance Distributed Web Crawler ... Brief introduction and Design, techniques of High Performance Web Crawler. 5/24/09 ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 15
Provided by: kaly4
Category:

less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler


1
Design and Implementation of a High-Performance
Distributed Web Crawler
  • Vladislav Shkapenyuk Torsten Suel

2
Introduction
  • Presented by Kalyan Boggavarapu, Graduate
    student, Lehigh University.
  • Brief introduction and Design, techniques of High
    Performance Web Crawler

3
Outline
  • Definitions
  • Small Crawler
  • Components
  • URL handling
  • Results

4
Definitions
  • Crawler A program which visits remote sites and
    automatically downloads their contents for
    indexing.

Good Crawler
Good Strategy
Efficiency
We address
5
Why High Performance ?
  • Recent work
  • Reduce the number of pages downloaded.
  • eg focused crawlers.
  • Maximize the benefit per downloaded page.
  • Our Goal
  • To maximize Number of Pages/Sec downloaded.

6
A Small Crawler Configuration
7
  • Crawler application It prepares the list of URLs
    to be crawled.
  • Crawler System downloads the pages.

8
(No Transcript)
9
Components
  • Crawler System contains
  • Crawler Manager
  • Crawl Speed
  • Robot Exclusion
  • Downloaders
  • A high performance asynchronous HTTP client
    capable of downloading hundreds of web pages in
    parallel
  • DNS Resolvers
  • Optimized stub DNS resolver that forwards queries
    to local DNS servers

Crawler System
10
C
Crawler Manager
Background
Foreground
Re-order according to priority
Snapshots of data structures
  • Maintain a time interval of at least 30 sec for
    contacting a server

Get ip address from DNS Revolvers
Check for the robot files
Internally stored
Get from the servers
Exclude the URLs
11
Downloaders
Python
Read the list of URLs from Crawler manager
1000 connections to servers
Download the pages
Write to the disk
12
DNS Revolvers
C
  • Prob DNS is synchronous mode.
  • I.e it replies one query at a time
  • - that would be slow
  • Sol Asynchronous mode implemented

Is Download Speed a DNS lookup speed ?
Not in over case, we had a limited bandwidth
13
URL Handling
1 URL 10B
Parsing
Normalizing
Check for Seen URLs
  • Hourly Update, by Merging the new URLs

Data Structure on Disk
Memory , Red-Black trees
Does this URL searching slow down the over all
speed of crawling?
Why do we search for the URLs previously seen ?
No, new URLs are not used immediately, they are
used after some hours
1)We do not want to download already
downloaded 2) We do not want to store already
stored
14
Results
  • 120 M pages from 5 M hosts.
  • Time taken 18 days.
  • Connection T3.
  • Graph represents the incoming bytes.
  • Crawler backups are seen in the spikes of zero.
  • Speed
  • Max 300 pg/sec.
  • Average 140 pg/sec.
  • Router limitation and bandwidth limitation.
  • Future study of the scalability of the system.
Write a Comment
User Comments (0)
About PowerShow.com