Web crawler presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web crawler

1
Web crawler

By. PowerGroup

2
Member Group

1. Thanida Limsirivallop
47541164
2. Lucksamon Sivapattarakumpon 47541404
3. Patrapee Suwannan
47542212
4. Rataruch Tongpradith
47542246
Website http//pirun.ku.ac.th/b4754116

3
WebCrawler Definition

Crawler is an automatic program (sometimes
called a "robot") which explores the World Wide
Web, following the links and searching for
information or building a database such programs
are often used to build automated indexes for the
Web, allowing users to do keyword searches for
Web documents
Web crawlers are programs that exploit the graph
structure of the Web to move from page to page.

4
Crawling The Web

Beginning -gt a key motivation for designing
Web crawlers has been to retrieve Web pages and
add them or their representations to a local
repository.
Simplest Form -gt A crawler starts from a seed
page and then uses the external links within it
to attend to other pages. The process repeats
with the new pages ordering more external links
to follow, until a sufficient number of pages are
identified or some higher level objective is
reached.
General purpose search engines
Serving as entry points to Web pages strive
for coverage that is as broad as possible. They
use Web crawlers to maintain their index
databases amortizing the cost of crawling and
indexing over the millions of queries received by
them.

5
Crawling Infrastructure
Basic Sequence Crawler
6
Web Crawler Requirement

The goal of the proposed crawler is to re-create
the look and feel of a website as it existed on
the crawl date.
The tool should be extensible to adapt to future
changes in web standards.
General Requirements
Comprehensive downloading (Saving) the look and
feel of the page are mirrored exactly, down to
every image, link, and dynamic element.
Scope of the crawl, Difficult link and depth.
An intuitive and extensible interface (Interface)
The crawler should use an intuitive graphical
user interface
Command line interface, Caching Pages and Rights
Security of the server and browser (Niceness)
The robots exclusion protocol must be obeyed.
This requires downloading the robots.txt file
before crawling the rest of the website.

7
Web Crawler Requirement

Presentation of dynamic elements (Dynamic page
image) The crawler must download
Shockwave-Flash files and other content listed in
EMBED tags. Care must be taken not to load the
same file multiple times
Accurate look and feel (Presentation) The most
important aspect of archiving web sites, all
links should be accurate. Almost always,
re-crawling an archive is needed to rewrite all
links to be internal.

8
Dominos A New Web Crawlers Design

This paper describes the design and
implementation of a realtime distributed system
of Web crawling running on a cluster of machines
and introduced a high availability system of
crawling called Dominos.
Dominos is a dynamic system which accounts for
its highly flexible deployment, maintainability
and enhanced fault tolerance. And finally this
paper discusse the experimental results obtained,
comparing them with other documented systems.

9
An investigation of web crawler behavior
characterization and metrics

This paper presents a characterization study of
search-engine crawlers. The propose of this paper
is using Web-server access logs from five
academic sites in three different countries.
There are results and observations that provide
useful insights into crawler behavior and serve
as basis of our ongoing work on the automatic
detection of Web crawlers.

10
Crawler-Friendly Web Servers

This paper studies how to make web servers more
crawler friendly and to evaluate simple and
easy-to-incorporate modications to web servers so
that there are signicant bandwidth savings.
This paper proposes that web servers can export
meta-data describing their pages so that crawlers
can eciently create and maintain large, fresh
repositories.

11
The Evolution of the Web andImplications for an
Incremental Crawler

This paper studies how to build an eective
incremental crawler. The crawler selectively and
incrementally updates its index and/or local
collection of web pages, instead of periodically
refreshing the collection in batch mode.
Based on the results, discussing various design
choices for a crawler and the possible
trade-offs. And then proposed an architecture for
an incremental crawler, which combines the best
strategies identied.

12
SharpSpider Spidering the Web through Web
Services

This paper presents that SharpSpider, a
distributed, C spider designed to address the
issues of scalability, decentralisation and
continuity of a Web crawl.
Fundamental to the design of SharpSpider is the
publication of an API for use by other services
on the network. Such an API grants access to a
constantly refreshed index built after successive
crawls of the Web.

13
The Anatomy of a Large-Scale HypertextualWeb
Search Engine

This paper presents Google, a prototype of a
large-scale search engine which makes heavy use
of the structure present in hypertext. This paper
provides an in-depth description of large-scale
web search engine.
Google is designed to be a scalable search
engine. The primary goal is to provide high
quality search result over a rapidly growing
World Wide Web. Furthermore, Google is a complete
architecture for gathering web pages, indexing
them, and performing search queries over them.

14
Incremental Web SearchTracking Changes in the
Web

This paper presents the algorithms and
techniques useful for solving problem that is
detecting web pages, extracting of web pages and
evaluating of web change.
This paper presents an application using the
techniques and algorithms that named Web Daily
News Assistant (WebDNA) Currently deployed on
NYU web site.
Model the change of web documents using
survival analysis. Modeling web changes is useful
for web crawler scheduling and web caching.

15
An Investigation of Documents from the World Wide
Web

This paper reports on examination of pages from
WWW and there are analyzing data collected by the
Inktomi Web Crawler. There are many analysis of
HTML such as Evolution, Improving Web Content,
Control of HTML, Sociological insights, User
Studies, Content analyses, and structure
analysis. And there are many tool to perform the
data collection.

16
A Crawler-based Study of Spyware on the Web

Crawling the Web, downloading content from a
large number of sites, and then analyzing it to
determine whether it is malicious. In this way,
we can answer several important questions. For
example
How much spyware is on the Internet?
Where is that spyware located (e.g., game sites,
childrens
sites, adult sites, etc.)
- How likely is a user to encounter spyware
through random browsing?
What kinds of threats does that spyware pose?
What fraction of executables on the Internet are
infected
with spyware?

17
Estimating Frequency of Change

estimating the change frequency of data to
improve Web crawlers, Web caches and to help data
mining by developing several frequency estimators
and identifying various scenarios.

18
Collaborative Web Crawler over High-speed
Research Network

Distribute web crawler that utilizes the existing
research networks.
Distributed web crawling is a distributed
computing technique whereby Internet search
engines employ many computers to index the
Internet via web crawling. The idea is to spread
out the required resources of computation and
bandwidth to many computers and networks.

19
Crawling-based Classification

The categorization of a database is determined
by its distribution of documents across
categories.

20
Mercator A Scalable, Extensible Web Crawler

design features a crawler core for handling the
main crawling tasks, and extensibility through
protocol and processing modules.
Users may supply new modules for performing
customized crawling tasks.
We have used Mercator for a variety of purposes,
including performing random walks on the web,
crawling our corporate intranet, and collecting
statistics about the web at large.

21
Parallel Crawlers

A paralle crawler is a crawler that runs multiple
processes in parallel. The goal is to maximize
the download rate while minimizing the overhead
from parallelization and to avoid repeated
downloads of the same page.
To avoid downloading the same page more than
once, the crawling system requires a policy for
assigning the new URLs discovered during the
crawling process, as the same URL can be found by
two different crawling processes.

22
Efficient Crawling Through URL Ordering

Define several different kinds of importance
metrics, and built three models to evaluate
crawlers.
Then evaluated several combinations of importance
and ordering metrics, using the Stanford Web
pages.

23
Efficient URL Caching for World Wide Web
Crawling

URL caching is very effective
Any web crawler must maintain a collection of
URLs that areto be downloaded. Moreover, since it
would be unacceptable to download the same URL
over and over,
recommend a cache size of between 100 to 500
entries per crawling thread
the size of the cache needed to achieve top
performance depends on the number of threads

24
Multicasting a Web Repository

proposing an alternative to multiple crawlers A
single central crawler builds a database of Web
pages, and provides a multicast service for
clients that need a subset of this Web image.

25
Distributed High-performance Web Crawlers

Distributing the workload across multiple
machines
by divide and/or duplicate these pieces in the
cluster.
The program will run simultaneously on two or
more computers that are communicating with each
other over a network.

26
Parallel Crawling for Online Social Networks

a centralized queue implemented as a database
table is conveniently used to coordinate the
operation of all the crawlers to prevent
redundant crawling.
This offers two tiers of parallelism, allowing
multiple crawlers to be run on each of the
multiple agents, where the crawlers are not
affected by any potential failing of the other
crawlers.

27
Finding replicated web collections

Improving web crawling by avoiding redundant
crawling in the Google system and proposing a new
algorithm for efficiently identifying similar
collections that form what we call a similar
cluster.

28
Learnable Web Crawler

In this section, we will shortly explain a
characteristic of the web crawler, learnable
ability. We build some knowledge bases from the
previous crawling. These knowledge bases are
Seed URLs
Topic Keywords
URL Prediction

29
The Algorithm No KB

Crawling_with_no_KB (topic)
seed_urls Search (topic, t)
keywords topic
foreach url (seed_urls)
url_topic url.title url.description
url_score sim (keywords, url_topic)
enqueue (url_queue, url, url_score)
while (url (url_queue) gt 0)
url dequeue_url_with_max_score
(url_queue)
page fetch_new_document (url)
page_score sim (keywords, page)
foreach link (extract_urls (page))
link_score a.sim(keywords,link.anchortext)
(1-a).page_score
enqueue (url_queue, link, link_score)

30
The Algorithm With KB

Crawling_with_KB (KB, topic)
seed_urls get_seed_url(KB,topic,t)
keywords get_topic_keyword(KB,topic)
foreach url (seed_urls)
url_score get_pred_score(KB,topic,url)
enqueue (url_queue, url, url_score)
while (url (url_queue) gt 0)
url dequeue_url_with_max_score(url_queue)
page fetch_new_document (url)
page_score sim (keywords, page)
foreach link (extract_urls (page))
pred_link_score get_pred_score(KB,topic,url
)
link_score a.(ß.sim(keywords,link.anchortex
t)
(1- ß).pred_link_score)
(1-a) . page_score
enqueue (url_queue, link, link_score)

31
Overall Process

Learnable_Crawling (topic)
if (no KB)
Collection Crawling_with_no_KB (topic)
else
Collection Crawling_with_KB (KB, topic)
/ To learn the previous crawling /
KB.seed_urls learn_seed_URL (Collection)
KB.keywords learn_topic_keyword
(Collection)
KB.url_predict learn_URL_prediction(Collectio
n)

32
Learning Analysis
33
(No Transcript)
34

THANK YOU FOR YOUR ATTENTION .

Write a Comment

User Comments (0)

About PowerShow.com

Web crawler PowerPoint PPT Presentation