Design and Implementation of a HighPerformance Distributed Web Crawler

About This Presentation

Title:

Design and Implementation of a HighPerformance Distributed Web Crawler

Description:

If we immediately insert all the URLs into the Berkeley DB B-tree structure ... Reasonable amount (100MB) of buffer space for Berkeley DB ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 20

Provided by: dblab1

Category:

more less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler

1
Design and Implementation of a High-Performance
Distributed Web Crawler

Polytechnic University
Vladislav Shkapenyuk, Torsten Suel
06/13/2006
?? 2??
???

2
Contents

3. Implementation Details and Algorithmic
Techniques
3.4 Crawl Manager Data Structure
3.5 Scheduling Policy and Manager Performance
4. Experimental Results and Experiences
4.1 Results of a Large Crawl
4.2 Network Limits and Speed Control
4.3 System Performance and Configuration
4.4 Other Experiences
5. Comparison with Other Systems
6. Conclusions and Future Work

3
Manager Data Structures
4
3.4 Crawl Manager Data Structure

The crawl manager maintains a number of data
structures for scheduling the requests on the
downloaders
FIFO request queue
A list of request files
Contains a few hundred or thousand URLs
Be located on a disk
Not immediately loaded
Stay on disk as long as possible

5
3.4 Crawl Manager Data Structure (2)

There are a number of FIFO host queues containing
URLs, organized by hostname
Berkeley DB using a single B-Tree (hostname
key)
Once a host has been selected for download
we take the first entry in the corresponding host
queue
send it to a downloader

6
3.4 Crawl Manager Data Structure (3)

Three different host structures
Host dictionary
An entry for each host
A pointer to the corresponding host queue
Ready queue
A pointer to those hosts that are ready for
download
Waiting queue
A pointer to those hosts that have recently been
accessed and are now waiting for 30 seconds

7
3.4 Crawl Manager Data Structure (4)

Each host pointer in the ready queue has as its
key value the request number of the first URL in
the corresponding host queue
Select the URL with the lowest request number
among all URLs that are ready to be downloaded,
and it to the downloader
After the page has been downloaded, a pointer to
the host is inserted into the waiting queue with
waiting time
If its waiting time has passed, transfer hosts
back into the ready queue

8
3.4 Crawl Manager Data Structure (5)

When a new host is encountered
Create a host structure
Put it into the host dictionary
Insert a pointer to the host into the ready queue
When all URLs in a queue have been downloaded
The host is deleted from the structures
Certain information in the robots files is kept
If a host is not responding
Put the host into the waiting queue for some time

9
3.5 Scheduling Policy and Manager Performance

If we immediately insert all the URLs into the
Berkeley DB B-tree structure
Quickly grow beyond main memory size
Result in bad I/O behavior
Thus, we would like to delay inserting the URLs
into the structures as long as possible

10
3.5 Scheduling Policy and Manager Performance (2)

Goal significantly decrease the size of the
structures
We have enough hosts to keep the crawl running at
the given speed
The total number of host structures and
corresponding URL queues at any time is about
xstndnt
x the number of hosts in the ready queue
st an estimate of the number of currently
waiting
nd the number of hosts that are waiting because
they were down
nt the number of hosts in the dictionary
(ignoring)
The number of host structures will usually be
less than 2x
Exgt x 10000, which for a speed of 120 pages per
second resulted in at most 16000 hosts in the
manager

11
3.5 Scheduling Policy and Manager Performance (3)

The ordering is in fact the same as if we
immediately insert all request URLs into the
manager
Assume that when the number of hosts in the ready
queue drops below x, the manager will be able to
increase this number again to at least x before
the downloaders actually run out of work

12
4.1 Results of a Large Crawl

120 million web pages on about 5 million hosts
(18 days)
In the last 4days, the crawler was running at
very low speed to download URLs from a few
hundred very large host queues that remained
During operation, the speed of the crawler was
limited by us to a certain rate, depending on
time of day, other users on campus were not
inconvenienced

13
4.1 Results of a Large Crawl (2)

Network errors
A server down
Dows not exist
Behaves incorrectly
Extremely slow
Some robots files were downloaded many times

14
4.2 Network Limits and Speed Control

We had to control the speed of our crawler so
that impact on other campus users is minimized
Usually limited rates to about 80 pages per
second(1MB/s) during peak times
Up to 180pages per second during the late night
and early morning
Limits can be changed and displayed via a
web-based Java interface
Connected to the Internet by a T3 link, with
Cisco 3620 as main campus router

15
4.2 Network Limits and Speed Control (2)

This data includes all traffic going in and out
of the poly.edu domain over the 24 hours of May
28, 2001.
At high speed, relatively little other traffic
Perform a check point every 4 hours
Does not exist in the outgoing bytes, since the
crawler only sends out small requests
Clearly visible in the number of outgoing frames,
partly due to HTTP requests and the DNS system

Incoming bytes
outgoing bytes
outgoing frames
16
4.3 System Performance and Configuration

Sun Ultra10 workstations and a dual-processor Sun
E250
Downloader
Most of the CPU, little memory
Manager
Little CPU time
Reasonable amount (100MB) of buffer space for
Berkeley DB
Downloader and the manager on one machine, and
all other components on the other

17
5. Comparison with Other Systems

Mercator
Flexibility through pluggable components
Centralized crawler
Data can be directly parsed in memory and does
not have to be written from disk
Uses caching to catch most of the random I/O and
fast disk system
Good I/O performance by hashing hostnames

18
5. Comparison with Other Systems (2)

Atrax
A recent distributed version of Mercator
Ties several Mercator systems together
Not yet familiar with many details of Atrax
Uses a disk-efficient merge
Very similar approach for scaling
Uses Mecator as its basic unit of replication

19
6. Conclusions and Future Work

We have
Described the architecture and implementation
details of our crawling system
Presented some preliminary experiments
There are obviously many improvements to the
system
Future work
A detailed study of the scalability of the system
and the behavior of its components

Write a Comment

User Comments (0)

About PowerShow.com

Design and Implementation of a HighPerformance Distributed Web Crawler - PowerPoint PPT Presentation

Design and Implementation of a HighPerformance Distributed Web Crawler

If we immediately insert all the URLs into the Berkeley DB B-tree structure ... Reasonable amount (100MB) of buffer space for Berkeley DB ... – PowerPoint PPT presentation