Design and Implementation of a HighPerformance Distributed Web Crawler - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Design and Implementation of a HighPerformance Distributed Web Crawler

Description:

If we immediately insert all the URLs into the Berkeley DB B-tree structure ... Reasonable amount (100MB) of buffer space for Berkeley DB ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 20
Provided by: dblab1
Category:

less

Transcript and Presenter's Notes

Title: Design and Implementation of a HighPerformance Distributed Web Crawler


1
Design and Implementation of a High-Performance
Distributed Web Crawler
  • Polytechnic University
  • Vladislav Shkapenyuk, Torsten Suel
  • 06/13/2006
  • ?? 2??
  • ???

2
Contents
  • 3. Implementation Details and Algorithmic
    Techniques
  • 3.4 Crawl Manager Data Structure
  • 3.5 Scheduling Policy and Manager Performance
  • 4. Experimental Results and Experiences
  • 4.1 Results of a Large Crawl
  • 4.2 Network Limits and Speed Control
  • 4.3 System Performance and Configuration
  • 4.4 Other Experiences
  • 5. Comparison with Other Systems
  • 6. Conclusions and Future Work

3
Manager Data Structures
4
3.4 Crawl Manager Data Structure
  • The crawl manager maintains a number of data
    structures for scheduling the requests on the
    downloaders
  • FIFO request queue
  • A list of request files
  • Contains a few hundred or thousand URLs
  • Be located on a disk
  • Not immediately loaded
  • Stay on disk as long as possible

5
3.4 Crawl Manager Data Structure (2)
  • There are a number of FIFO host queues containing
    URLs, organized by hostname
  • Berkeley DB using a single B-Tree (hostname
    key)
  • Once a host has been selected for download
  • we take the first entry in the corresponding host
    queue
  • send it to a downloader

6
3.4 Crawl Manager Data Structure (3)
  • Three different host structures
  • Host dictionary
  • An entry for each host
  • A pointer to the corresponding host queue
  • Ready queue
  • A pointer to those hosts that are ready for
    download
  • Waiting queue
  • A pointer to those hosts that have recently been
    accessed and are now waiting for 30 seconds

7
3.4 Crawl Manager Data Structure (4)
  • Each host pointer in the ready queue has as its
    key value the request number of the first URL in
    the corresponding host queue
  • Select the URL with the lowest request number
    among all URLs that are ready to be downloaded,
    and it to the downloader
  • After the page has been downloaded, a pointer to
    the host is inserted into the waiting queue with
    waiting time
  • If its waiting time has passed, transfer hosts
    back into the ready queue

8
3.4 Crawl Manager Data Structure (5)
  • When a new host is encountered
  • Create a host structure
  • Put it into the host dictionary
  • Insert a pointer to the host into the ready queue
  • When all URLs in a queue have been downloaded
  • The host is deleted from the structures
  • Certain information in the robots files is kept
  • If a host is not responding
  • Put the host into the waiting queue for some time

9
3.5 Scheduling Policy and Manager Performance
  • If we immediately insert all the URLs into the
    Berkeley DB B-tree structure
  • Quickly grow beyond main memory size
  • Result in bad I/O behavior
  • Thus, we would like to delay inserting the URLs
    into the structures as long as possible

10
3.5 Scheduling Policy and Manager Performance (2)
  • Goal significantly decrease the size of the
    structures
  • We have enough hosts to keep the crawl running at
    the given speed
  • The total number of host structures and
    corresponding URL queues at any time is about
  • xstndnt
  • x the number of hosts in the ready queue
  • st an estimate of the number of currently
    waiting
  • nd the number of hosts that are waiting because
    they were down
  • nt the number of hosts in the dictionary
    (ignoring)
  • The number of host structures will usually be
    less than 2x
  • Exgt x 10000, which for a speed of 120 pages per
    second resulted in at most 16000 hosts in the
    manager

11
3.5 Scheduling Policy and Manager Performance (3)
  • The ordering is in fact the same as if we
    immediately insert all request URLs into the
    manager
  • Assume that when the number of hosts in the ready
    queue drops below x, the manager will be able to
    increase this number again to at least x before
    the downloaders actually run out of work

12
4.1 Results of a Large Crawl
  • 120 million web pages on about 5 million hosts
    (18 days)
  • In the last 4days, the crawler was running at
    very low speed to download URLs from a few
    hundred very large host queues that remained
  • During operation, the speed of the crawler was
    limited by us to a certain rate, depending on
    time of day, other users on campus were not
    inconvenienced

13
4.1 Results of a Large Crawl (2)
  • Network errors
  • A server down
  • Dows not exist
  • Behaves incorrectly
  • Extremely slow
  • Some robots files were downloaded many times

14
4.2 Network Limits and Speed Control
  • We had to control the speed of our crawler so
    that impact on other campus users is minimized
  • Usually limited rates to about 80 pages per
    second(1MB/s) during peak times
  • Up to 180pages per second during the late night
    and early morning
  • Limits can be changed and displayed via a
    web-based Java interface
  • Connected to the Internet by a T3 link, with
    Cisco 3620 as main campus router

15
4.2 Network Limits and Speed Control (2)
  • This data includes all traffic going in and out
    of the poly.edu domain over the 24 hours of May
    28, 2001.
  • At high speed, relatively little other traffic
  • Perform a check point every 4 hours
  • Does not exist in the outgoing bytes, since the
    crawler only sends out small requests
  • Clearly visible in the number of outgoing frames,
    partly due to HTTP requests and the DNS system

Incoming bytes
outgoing bytes
outgoing frames
16
4.3 System Performance and Configuration
  • Sun Ultra10 workstations and a dual-processor Sun
    E250
  • Downloader
  • Most of the CPU, little memory
  • Manager
  • Little CPU time
  • Reasonable amount (100MB) of buffer space for
    Berkeley DB
  • Downloader and the manager on one machine, and
    all other components on the other

17
5. Comparison with Other Systems
  • Mercator
  • Flexibility through pluggable components
  • Centralized crawler
  • Data can be directly parsed in memory and does
    not have to be written from disk
  • Uses caching to catch most of the random I/O and
    fast disk system
  • Good I/O performance by hashing hostnames

18
5. Comparison with Other Systems (2)
  • Atrax
  • A recent distributed version of Mercator
  • Ties several Mercator systems together
  • Not yet familiar with many details of Atrax
  • Uses a disk-efficient merge
  • Very similar approach for scaling
  • Uses Mecator as its basic unit of replication

19
6. Conclusions and Future Work
  • We have
  • Described the architecture and implementation
    details of our crawling system
  • Presented some preliminary experiments
  • There are obviously many improvements to the
    system
  • Future work
  • A detailed study of the scalability of the system
    and the behavior of its components
Write a Comment
User Comments (0)
About PowerShow.com