Implementation Issues of Distributed Crawlers - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Implementation Issues of Distributed Crawlers

Description:

Networked software systems that perform indexing services ... IP address (benefit: able to geographically separate crawling; disadvantage: reverse-DNS lookup) ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 22
Provided by: diteshG
Category:

less

Transcript and Presenter's Notes

Title: Implementation Issues of Distributed Crawlers


1
Implementation Issues of Distributed Crawlers
  • A Review of Implementation Issues in
  • Distributed Crawling Architectures
  • by Ditesh Kumar Loh Jin Tiam
  • Faculty of Computer Science
  • and Information Technology
  • Universiti Malaya

2
Implementation Issues of Distributed Crawlers
  • What are crawlers?
  • Networked software systems that perform indexing
    services of certain resources.
  • Most crawlers focus on web content (eg HTML
    pages).
  • Generally serves as a backend to search engines.

3
Implementation Issues of Distributed Crawlers
  • Functionality of Crawlers

Download Resource
Store Resource
Retrieve Pointer
Perform Pointer Analysis
Store New Pointers
4
Implementation Issues of Distributed Crawlers
  • Issues
  • Architectures
  • Coverage, overlap and communication overhead
  • Assignment of Responsibility
  • Scalability
  • URL Partitioning

5
Implementation Issues of Distributed Crawlers
  • Distributed Architectures
  • One server, multiple crawlers

Server
crawler
crawler
crawler
6
Implementation Issues of Distributed Crawlers
  • Architectures (many crawlers-one server)
  • Advantages
  • Well known architecture, easy to code.
  • Easy to partition responsibility.
  • Easy to perform fault-tolerance (eg, if crawler
    dies, it can be restarted).
  • Scalability not an issue.
  • Process, performance monitoring becomes easier.
  • Less network traffic.

7
Implementation Issues of Distributed Crawlers
  • Architectures (many crawlers-one server)
  • Dis-advantages
  • Potential bottleneck.
  • Not practical when number of crawlers is large.

8
Implementation Issues of Distributed Crawlers
  • Architectures
  • Peer-to-peer

crawler
crawler
crawler
9
Implementation Issues of Distributed Crawlers
  • Architectures (p2p)
  • Advantages
  • Scalable if coded properly.
  • Automatic partitioning of responsibility.
  • No central point of failure.
  • Fault-tolerant (eg, if crawler dies, it can be
    replaced by another crawler).

10
Implementation Issues of Distributed Crawlers
  • Architectures (p2p)
  • Dis-advantages
  • Hard to code properly.
  • Network traffic relatively higher.
  • Harder to collate information collected.
  • Harder to monitor performance.

11
Implementation Issues of Distributed Crawlers
  • Balancing and Contravariance (URL Partitioning)
  • First Rule Given a set of resources to crawl,
    highest efficiency is attained if two crawlers
    never crawl the same subset (balancing).

Crawler B

Crawler A
a.com
b.com
c.com
d.com
Crawler C
Crawler D
12
Implementation Issues of Distributed Crawlers
  • Balancing and Contravariance (ctd.)
  • Second Rule If total resource size changes, then
    the subset to be crawled by each crawler must
    change accordingly (contravariance).

Crawler B
Crawler C

Crawler A
Crawler D
Crawler E
Crawler H
Crawler F
Crawler G
13
Implementation Issues of Distributed Crawlers
  • Balancing and Contravariance (ctd.)
  • Four issues first, ensure all crawlers get
    roughly an equal number of crawling assignments.
  • Second, ensure that the assignments never clash.
  • Third, it must be possible for any crawler to
    know who is responsible to crawl a particular
    resource.

14
Implementation Issues of Distributed Crawlers
  • Balancing and Contravariance (ctd.)
  • Fourth, each crawler must be able to adapt to new
    crawling assignments (to ensure contravariance).

15
Implementation Issues of Distributed Crawlers
  • Balancing and Contravariance (ctd.)
  • A good technique is hashing the resource name and
    partitioning those hashes between the crawlers.
  • Papers suggest usage of domain names for hash
    source.
  • Better way is to use IP address (benefit able to
    geographically separate crawling disadvantage
    reverse-DNS lookup).

16
Implementation Issues of Distributed Crawlers
  • Coverage, overlap, communication overhead
  • Coverage is U/I
  • Overlap is N/U
  • Communication overhead is E/N
  • Goal
  • Maximize coverage
  • Minimize overlap
  • Minimize communication overhead

17
Implementation Issues of Distributed Crawlers
  • Assignment of Responsibility
  • Firewall ignores resource links not in domain of
    crawling agent (CA). Zero communication overhead
    and overlap, but lower coverage.
  • Cross-over retrieves resources links not in
    domain of CA. Zero communication overhead and
    greater coverage but high overlap ratio.
  • Exchange send resource links to relevant CA.
    Maximized coverage, reduced overlap but high
    communication overhead.

18
Implementation Issues of Distributed Crawlers
  • Assignment of Responsibility
  • Best solution combination of cross-over and
    exchange models.
  • Cross-over model used when the communication
    overhead far outweighs the actual retrieval
    process.
  • Exchange model used when communication overhead
    is just a fraction of the actual retrieval cost,
    the exchange model is used.
  • Intuitively, we expect that the exchange model be
    used most of the time.

19
Implementation Issues of Distributed Crawlers
  • Scalability
  • Defined as the number of pages crawled per second
    per crawler should be (almost) independent of the
    number of crawlers.
  • Throughput grows linearly with the number of
    agents (easy to measure performance).
  • Load must be handled transparently by cleanly
    adding new agents in crawling swarm.

20
Implementation Issues of Distributed Crawlers
  • Scalability
  • Crawling agents can have multiple threads. Google
    uses 300 threads/agent, Mercator uses 100
    threads/agent, Polytechnic uses 1000
    threads/agent.
  • Issue is OS thread support vs fork()-ing support.

21
Implementation Issues of Distributed Crawlers
  • Conclusion
  • Distributed crawlers are harder to
    design/code/monitor.
  • Yet, distributed crawler architecture has
    significant benefits that justify building it.
  • Issues that require further attention include
    using time information for next-crawl prediction,
    crawling relatively dynamic resources etc.
Write a Comment
User Comments (0)
About PowerShow.com