Title: Design and Implementation of a HighPerformance Distributed Web Crawler
1Design and Implementation of a High-Performance
Distributed Web Crawler
- Polytechnic University
- Vladislav Shkapenyuk, Torsten Suel
- 06/13/2006
- ?? 2??
- ???
2Contents
- 3. Implementation Details and Algorithmic
Techniques - 3.4 Crawl Manager Data Structure
- 3.5 Scheduling Policy and Manager Performance
- 4. Experimental Results and Experiences
- 4.1 Results of a Large Crawl
- 4.2 Network Limits and Speed Control
- 4.3 System Performance and Configuration
- 4.4 Other Experiences
- 5. Comparison with Other Systems
- 6. Conclusions and Future Work
3Manager Data Structures
43.4 Crawl Manager Data Structure
- The crawl manager maintains a number of data
structures for scheduling the requests on the
downloaders - FIFO request queue
- A list of request files
- Contains a few hundred or thousand URLs
- Be located on a disk
- Not immediately loaded
- Stay on disk as long as possible
53.4 Crawl Manager Data Structure (2)
- There are a number of FIFO host queues containing
URLs, organized by hostname - Berkeley DB using a single B-Tree (hostname
key) - Once a host has been selected for download
- we take the first entry in the corresponding host
queue - send it to a downloader
63.4 Crawl Manager Data Structure (3)
- Three different host structures
- Host dictionary
- An entry for each host
- A pointer to the corresponding host queue
- Ready queue
- A pointer to those hosts that are ready for
download - Waiting queue
- A pointer to those hosts that have recently been
accessed and are now waiting for 30 seconds
73.4 Crawl Manager Data Structure (4)
- Each host pointer in the ready queue has as its
key value the request number of the first URL in
the corresponding host queue - Select the URL with the lowest request number
among all URLs that are ready to be downloaded,
and it to the downloader - After the page has been downloaded, a pointer to
the host is inserted into the waiting queue with
waiting time - If its waiting time has passed, transfer hosts
back into the ready queue
83.4 Crawl Manager Data Structure (5)
- When a new host is encountered
- Create a host structure
- Put it into the host dictionary
- Insert a pointer to the host into the ready queue
- When all URLs in a queue have been downloaded
- The host is deleted from the structures
- Certain information in the robots files is kept
- If a host is not responding
- Put the host into the waiting queue for some time
93.5 Scheduling Policy and Manager Performance
- If we immediately insert all the URLs into the
Berkeley DB B-tree structure - Quickly grow beyond main memory size
- Result in bad I/O behavior
- Thus, we would like to delay inserting the URLs
into the structures as long as possible
103.5 Scheduling Policy and Manager Performance (2)
- Goal significantly decrease the size of the
structures - We have enough hosts to keep the crawl running at
the given speed - The total number of host structures and
corresponding URL queues at any time is about - xstndnt
- x the number of hosts in the ready queue
- st an estimate of the number of currently
waiting - nd the number of hosts that are waiting because
they were down - nt the number of hosts in the dictionary
(ignoring) - The number of host structures will usually be
less than 2x - Exgt x 10000, which for a speed of 120 pages per
second resulted in at most 16000 hosts in the
manager
113.5 Scheduling Policy and Manager Performance (3)
- The ordering is in fact the same as if we
immediately insert all request URLs into the
manager - Assume that when the number of hosts in the ready
queue drops below x, the manager will be able to
increase this number again to at least x before
the downloaders actually run out of work
124.1 Results of a Large Crawl
- 120 million web pages on about 5 million hosts
(18 days) - In the last 4days, the crawler was running at
very low speed to download URLs from a few
hundred very large host queues that remained - During operation, the speed of the crawler was
limited by us to a certain rate, depending on
time of day, other users on campus were not
inconvenienced
134.1 Results of a Large Crawl (2)
- Network errors
- A server down
- Dows not exist
- Behaves incorrectly
- Extremely slow
- Some robots files were downloaded many times
144.2 Network Limits and Speed Control
- We had to control the speed of our crawler so
that impact on other campus users is minimized - Usually limited rates to about 80 pages per
second(1MB/s) during peak times - Up to 180pages per second during the late night
and early morning - Limits can be changed and displayed via a
web-based Java interface - Connected to the Internet by a T3 link, with
Cisco 3620 as main campus router
154.2 Network Limits and Speed Control (2)
- This data includes all traffic going in and out
of the poly.edu domain over the 24 hours of May
28, 2001. - At high speed, relatively little other traffic
- Perform a check point every 4 hours
- Does not exist in the outgoing bytes, since the
crawler only sends out small requests - Clearly visible in the number of outgoing frames,
partly due to HTTP requests and the DNS system
Incoming bytes
outgoing bytes
outgoing frames
164.3 System Performance and Configuration
- Sun Ultra10 workstations and a dual-processor Sun
E250 - Downloader
- Most of the CPU, little memory
- Manager
- Little CPU time
- Reasonable amount (100MB) of buffer space for
Berkeley DB - Downloader and the manager on one machine, and
all other components on the other
175. Comparison with Other Systems
- Mercator
- Flexibility through pluggable components
- Centralized crawler
- Data can be directly parsed in memory and does
not have to be written from disk - Uses caching to catch most of the random I/O and
fast disk system - Good I/O performance by hashing hostnames
185. Comparison with Other Systems (2)
- Atrax
- A recent distributed version of Mercator
- Ties several Mercator systems together
- Not yet familiar with many details of Atrax
- Uses a disk-efficient merge
- Very similar approach for scaling
- Uses Mecator as its basic unit of replication
196. Conclusions and Future Work
- We have
- Described the architecture and implementation
details of our crawling system - Presented some preliminary experiments
- There are obviously many improvements to the
system - Future work
- A detailed study of the scalability of the system
and the behavior of its components