Design and Implementation of a HighPerformance Distributed Web Crawler

About This Presentation

Title:

Design and Implementation of a HighPerformance Distributed Web Crawler

Description:

Design and Implementation of a High-Performance Distributed Web Crawler ... Brief introduction and Design, techniques of High Performance Web Crawler. 5/24/09 ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 15

Provided by: kaly4

Category:

more less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler

1
Design and Implementation of a High-Performance
Distributed Web Crawler

Vladislav Shkapenyuk Torsten Suel

2
Introduction

Presented by Kalyan Boggavarapu, Graduate
student, Lehigh University.
Brief introduction and Design, techniques of High
Performance Web Crawler

3
Outline

Definitions
Small Crawler
Components
URL handling
Results

4
Definitions

Crawler A program which visits remote sites and
automatically downloads their contents for
indexing.

Good Crawler
Good Strategy
Efficiency
We address
5
Why High Performance ?

Recent work
Reduce the number of pages downloaded.
eg focused crawlers.
Maximize the benefit per downloaded page.
Our Goal
To maximize Number of Pages/Sec downloaded.

6
A Small Crawler Configuration
7

Crawler application It prepares the list of URLs
to be crawled.
Crawler System downloads the pages.

8
(No Transcript)
9
Components

Crawler System contains
Crawler Manager
Crawl Speed
Robot Exclusion
Downloaders
A high performance asynchronous HTTP client
capable of downloading hundreds of web pages in
parallel
DNS Resolvers
Optimized stub DNS resolver that forwards queries
to local DNS servers

Crawler System
10
C
Crawler Manager
Background
Foreground
Re-order according to priority
Snapshots of data structures

Maintain a time interval of at least 30 sec for
contacting a server

Get ip address from DNS Revolvers
Check for the robot files
Internally stored
Get from the servers
Exclude the URLs
11
Downloaders
Python
Read the list of URLs from Crawler manager
1000 connections to servers
Download the pages
Write to the disk
12
DNS Revolvers
C

Prob DNS is synchronous mode.
I.e it replies one query at a time
- that would be slow
Sol Asynchronous mode implemented

Is Download Speed a DNS lookup speed ?
Not in over case, we had a limited bandwidth
13
URL Handling
1 URL 10B
Parsing
Normalizing
Check for Seen URLs

Hourly Update, by Merging the new URLs

Data Structure on Disk
Memory , Red-Black trees
Does this URL searching slow down the over all
speed of crawling?
Why do we search for the URLs previously seen ?
No, new URLs are not used immediately, they are
used after some hours
1)We do not want to download already
downloaded 2) We do not want to store already
stored
14
Results