Title: Design and Implementation of a HighPerformance Distributed Web Crawler
1Design and Implementation of a High-Performance
Distributed Web Crawler
- Vladislav Shkapenyuk Torsten Suel
2Introduction
- Presented by Kalyan Boggavarapu, Graduate
student, Lehigh University. - Brief introduction and Design, techniques of High
Performance Web Crawler
3Outline
- Definitions
- Small Crawler
- Components
- URL handling
- Results
4Definitions
- Crawler A program which visits remote sites and
automatically downloads their contents for
indexing.
Good Crawler
Good Strategy
Efficiency
We address
5Why High Performance ?
- Recent work
- Reduce the number of pages downloaded.
- eg focused crawlers.
- Maximize the benefit per downloaded page.
- Our Goal
- To maximize Number of Pages/Sec downloaded.
6A Small Crawler Configuration
7- Crawler application It prepares the list of URLs
to be crawled. - Crawler System downloads the pages.
8(No Transcript)
9Components
- Crawler System contains
- Crawler Manager
- Crawl Speed
- Robot Exclusion
- Downloaders
- A high performance asynchronous HTTP client
capable of downloading hundreds of web pages in
parallel - DNS Resolvers
- Optimized stub DNS resolver that forwards queries
to local DNS servers
Crawler System
10C
Crawler Manager
Background
Foreground
Re-order according to priority
Snapshots of data structures
- Maintain a time interval of at least 30 sec for
contacting a server
Get ip address from DNS Revolvers
Check for the robot files
Internally stored
Get from the servers
Exclude the URLs
11Downloaders
Python
Read the list of URLs from Crawler manager
1000 connections to servers
Download the pages
Write to the disk
12DNS Revolvers
C
- Prob DNS is synchronous mode.
- I.e it replies one query at a time
- - that would be slow
- Sol Asynchronous mode implemented
Is Download Speed a DNS lookup speed ?
Not in over case, we had a limited bandwidth
13URL Handling
1 URL 10B
Parsing
Normalizing
Check for Seen URLs
- Hourly Update, by Merging the new URLs
Data Structure on Disk
Memory , Red-Black trees
Does this URL searching slow down the over all
speed of crawling?
Why do we search for the URLs previously seen ?
No, new URLs are not used immediately, they are
used after some hours
1)We do not want to download already
downloaded 2) We do not want to store already
stored
14Results
- 120 M pages from 5 M hosts.
- Time taken 18 days.
- Connection T3.
- Graph represents the incoming bytes.
- Crawler backups are seen in the spikes of zero.
- Speed
- Max 300 pg/sec.
- Average 140 pg/sec.
- Router limitation and bandwidth limitation.
- Future study of the scalability of the system.