Implementation Issues of Distributed Crawlers - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Implementation Issues of Distributed Crawlers

Description:

Networked software systems that perform indexing services ... IP address (benefit: able to geographically separate crawling; disadvantage: reverse-DNS lookup) ... – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 22

Provided by: diteshG

Category:

more less

Transcript and Presenter's Notes

Title: Implementation Issues of Distributed Crawlers

1
Implementation Issues of Distributed Crawlers

A Review of Implementation Issues in
Distributed Crawling Architectures
by Ditesh Kumar Loh Jin Tiam
Faculty of Computer Science
and Information Technology
Universiti Malaya

2
Implementation Issues of Distributed Crawlers

What are crawlers?
Networked software systems that perform indexing
services of certain resources.
Most crawlers focus on web content (eg HTML
pages).
Generally serves as a backend to search engines.

3
Implementation Issues of Distributed Crawlers

Functionality of Crawlers

Download Resource
Store Resource
Retrieve Pointer
Perform Pointer Analysis
Store New Pointers
4
Implementation Issues of Distributed Crawlers

Issues
Architectures
Coverage, overlap and communication overhead
Assignment of Responsibility
Scalability
URL Partitioning

5
Implementation Issues of Distributed Crawlers

Distributed Architectures
One server, multiple crawlers

Server
crawler
crawler
crawler
6
Implementation Issues of Distributed Crawlers

Architectures (many crawlers-one server)
Advantages
Well known architecture, easy to code.
Easy to partition responsibility.
Easy to perform fault-tolerance (eg, if crawler
dies, it can be restarted).
Scalability not an issue.
Process, performance monitoring becomes easier.
Less network traffic.

7
Implementation Issues of Distributed Crawlers

Architectures (many crawlers-one server)
Dis-advantages
Potential bottleneck.
Not practical when number of crawlers is large.

8
Implementation Issues of Distributed Crawlers

Architectures
Peer-to-peer

crawler
crawler
crawler
9
Implementation Issues of Distributed Crawlers

Architectures (p2p)
Advantages
Scalable if coded properly.
Automatic partitioning of responsibility.
No central point of failure.
Fault-tolerant (eg, if crawler dies, it can be
replaced by another crawler).

10
Implementation Issues of Distributed Crawlers

Architectures (p2p)
Dis-advantages
Hard to code properly.
Network traffic relatively higher.
Harder to collate information collected.
Harder to monitor performance.

11
Implementation Issues of Distributed Crawlers

Balancing and Contravariance (URL Partitioning)
First Rule Given a set of resources to crawl,
highest efficiency is attained if two crawlers
never crawl the same subset (balancing).

Crawler B

Crawler A
a.com
b.com
c.com
d.com
Crawler C
Crawler D
12
Implementation Issues of Distributed Crawlers

Balancing and Contravariance (ctd.)
Second Rule If total resource size changes, then
the subset to be crawled by each crawler must
change accordingly (contravariance).

Crawler B
Crawler C

Crawler A
Crawler D
Crawler E
Crawler H
Crawler F
Crawler G
13
Implementation Issues of Distributed Crawlers

Balancing and Contravariance (ctd.)
Four issues first, ensure all crawlers get
roughly an equal number of crawling assignments.
Second, ensure that the assignments never clash.
Third, it must be possible for any crawler to
know who is responsible to crawl a particular
resource.

14
Implementation Issues of Distributed Crawlers

Balancing and Contravariance (ctd.)
Fourth, each crawler must be able to adapt to new
crawling assignments (to ensure contravariance).

15
Implementation Issues of Distributed Crawlers

Balancing and Contravariance (ctd.)
A good technique is hashing the resource name and
partitioning those hashes between the crawlers.
Papers suggest usage of domain names for hash
source.
Better way is to use IP address (benefit able to
geographically separate crawling disadvantage
reverse-DNS lookup).

16
Implementation Issues of Distributed Crawlers

Coverage, overlap, communication overhead
Coverage is U/I
Overlap is N/U
Communication overhead is E/N
Goal
Maximize coverage
Minimize overlap
Minimize communication overhead

17
Implementation Issues of Distributed Crawlers

Assignment of Responsibility
Firewall ignores resource links not in domain of
crawling agent (CA). Zero communication overhead
and overlap, but lower coverage.
Cross-over retrieves resources links not in
domain of CA. Zero communication overhead and
greater coverage but high overlap ratio.
Exchange send resource links to relevant CA.
Maximized coverage, reduced overlap but high
communication overhead.

18
Implementation Issues of Distributed Crawlers

Assignment of Responsibility
Best solution combination of cross-over and
exchange models.
Cross-over model used when the communication
overhead far outweighs the actual retrieval
process.
Exchange model used when communication overhead
is just a fraction of the actual retrieval cost,
the exchange model is used.
Intuitively, we expect that the exchange model be
used most of the time.

19
Implementation Issues of Distributed Crawlers

Scalability
Defined as the number of pages crawled per second
per crawler should be (almost) independent of the
number of crawlers.
Throughput grows linearly with the number of
agents (easy to measure performance).
Load must be handled transparently by cleanly
adding new agents in crawling swarm.

20
Implementation Issues of Distributed Crawlers

Scalability
Crawling agents can have multiple threads. Google
uses 300 threads/agent, Mercator uses 100
threads/agent, Polytechnic uses 1000
threads/agent.
Issue is OS thread support vs fork()-ing support.

21
Implementation Issues of Distributed Crawlers

Conclusion
Distributed crawlers are harder to
design/code/monitor.
Yet, distributed crawler architecture has
significant benefits that justify building it.
Issues that require further attention include
using time information for next-crawl prediction,
crawling relatively dynamic resources etc.

Write a Comment

User Comments (0)