Web crawler - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Web crawler

Description:

Dominos: A New Web Crawler's Design ... An investigation of web crawler behavior: characterization and metrics ... Crawler-Friendly Web Servers ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 35
Provided by: SiewN
Category:
Tags: crawler | web

less

Transcript and Presenter's Notes

Title: Web crawler


1
Web crawler
  • By. PowerGroup

2
Member Group
  • 1. Thanida Limsirivallop
  • 47541164
  • 2. Lucksamon Sivapattarakumpon 47541404
  • 3. Patrapee Suwannan
  • 47542212
  • 4. Rataruch Tongpradith
  • 47542246
  • Website http//pirun.ku.ac.th/b4754116

3
WebCrawler Definition
  • Crawler is an automatic program (sometimes
    called a "robot") which explores the World Wide
    Web, following the links and searching for
    information or building a database such programs
    are often used to build automated indexes for the
    Web, allowing users to do keyword searches for
    Web documents
  • Web crawlers are programs that exploit the graph
    structure of the Web to move from page to page.

4
Crawling The Web
  • Beginning -gt a key motivation for designing
    Web crawlers has been to retrieve Web pages and
    add them or their representations to a local
    repository.
  • Simplest Form -gt A crawler starts from a seed
    page and then uses the external links within it
    to attend to other pages. The process repeats
    with the new pages ordering more external links
    to follow, until a sufficient number of pages are
    identified or some higher level objective is
    reached.
  • General purpose search engines
  • Serving as entry points to Web pages strive
    for coverage that is as broad as possible. They
    use Web crawlers to maintain their index
    databases amortizing the cost of crawling and
    indexing over the millions of queries received by
    them.

5
Crawling Infrastructure
Basic Sequence Crawler
6
Web Crawler Requirement
  • The goal of the proposed crawler is to re-create
    the look and feel of a website as it existed on
    the crawl date.
  • The tool should be extensible to adapt to future
    changes in web standards.
  • General Requirements
  • Comprehensive downloading (Saving) the look and
    feel of the page are mirrored exactly, down to
    every image, link, and dynamic element.
  • Scope of the crawl, Difficult link and depth.
  • An intuitive and extensible interface (Interface)
    The crawler should use an intuitive graphical
    user interface
  • Command line interface, Caching Pages and Rights
  • Security of the server and browser (Niceness)
    The robots exclusion protocol must be obeyed.
    This requires downloading the robots.txt file
    before crawling the rest of the website.

7
Web Crawler Requirement
  • Presentation of dynamic elements (Dynamic page
    image) The crawler must download
    Shockwave-Flash files and other content listed in
    EMBED tags. Care must be taken not to load the
    same file multiple times
  • Accurate look and feel (Presentation) The most
    important aspect of archiving web sites, all
    links should be accurate. Almost always,
    re-crawling an archive is needed to rewrite all
    links to be internal.

8
Dominos A New Web Crawlers Design
  • This paper describes the design and
    implementation of a realtime distributed system
    of Web crawling running on a cluster of machines
    and introduced a high availability system of
    crawling called Dominos.
  • Dominos is a dynamic system which accounts for
    its highly flexible deployment, maintainability
    and enhanced fault tolerance. And finally this
    paper discusse the experimental results obtained,
    comparing them with other documented systems.

9
An investigation of web crawler behavior
characterization and metrics
  • This paper presents a characterization study of
    search-engine crawlers. The propose of this paper
    is using Web-server access logs from five
    academic sites in three different countries.
  • There are results and observations that provide
    useful insights into crawler behavior and serve
    as basis of our ongoing work on the automatic
    detection of Web crawlers.

10
Crawler-Friendly Web Servers
  • This paper studies how to make web servers more
    crawler friendly and to evaluate simple and
    easy-to-incorporate modications to web servers so
    that there are signicant bandwidth savings.
  • This paper proposes that web servers can export
    meta-data describing their pages so that crawlers
    can eciently create and maintain large, fresh
    repositories.

11
The Evolution of the Web andImplications for an
Incremental Crawler
  • This paper studies how to build an eective
    incremental crawler. The crawler selectively and
    incrementally updates its index and/or local
    collection of web pages, instead of periodically
    refreshing the collection in batch mode.
  • Based on the results, discussing various design
    choices for a crawler and the possible
    trade-offs. And then proposed an architecture for
    an incremental crawler, which combines the best
    strategies identied.

12
SharpSpider Spidering the Web through Web
Services
  • This paper presents that SharpSpider, a
    distributed, C spider designed to address the
    issues of scalability, decentralisation and
    continuity of a Web crawl.
  • Fundamental to the design of SharpSpider is the
    publication of an API for use by other services
    on the network. Such an API grants access to a
    constantly refreshed index built after successive
    crawls of the Web.

13
The Anatomy of a Large-Scale HypertextualWeb
Search Engine
  • This paper presents Google, a prototype of a
    large-scale search engine which makes heavy use
    of the structure present in hypertext. This paper
    provides an in-depth description of large-scale
    web search engine.
  • Google is designed to be a scalable search
    engine. The primary goal is to provide high
    quality search result over a rapidly growing
    World Wide Web. Furthermore, Google is a complete
    architecture for gathering web pages, indexing
    them, and performing search queries over them.

14
Incremental Web SearchTracking Changes in the
Web
  • This paper presents the algorithms and
    techniques useful for solving problem that is
    detecting web pages, extracting of web pages and
    evaluating of web change.
  • This paper presents an application using the
    techniques and algorithms that named Web Daily
    News Assistant (WebDNA) Currently deployed on
    NYU web site.
  • Model the change of web documents using
    survival analysis. Modeling web changes is useful
    for web crawler scheduling and web caching.

15
An Investigation of Documents from the World Wide
Web
  • This paper reports on examination of pages from
    WWW and there are analyzing data collected by the
    Inktomi Web Crawler. There are many analysis of
    HTML such as Evolution, Improving Web Content,
    Control of HTML, Sociological insights, User
    Studies, Content analyses, and structure
    analysis. And there are many tool to perform the
    data collection.

16
A Crawler-based Study of Spyware on the Web
  • Crawling the Web, downloading content from a
    large number of sites, and then analyzing it to
    determine whether it is malicious. In this way,
    we can answer several important questions. For
    example
  • How much spyware is on the Internet?
  • Where is that spyware located (e.g., game sites,
    childrens
  • sites, adult sites, etc.)
  • - How likely is a user to encounter spyware
    through random browsing?
  • What kinds of threats does that spyware pose?
  • What fraction of executables on the Internet are
    infected
  • with spyware?

17
Estimating Frequency of Change
  • estimating the change frequency of data to
    improve Web crawlers, Web caches and to help data
    mining by developing several frequency estimators
    and identifying various scenarios.

18
Collaborative Web Crawler over High-speed
Research Network
  • Distribute web crawler that utilizes the existing
    research networks.
  • Distributed web crawling is a distributed
    computing technique whereby Internet search
    engines employ many computers to index the
    Internet via web crawling. The idea is to spread
    out the required resources of computation and
    bandwidth to many computers and networks.

19
Crawling-based Classification
  • The categorization of a database is determined
    by its distribution of documents across
    categories.

20
Mercator A Scalable, Extensible Web Crawler
  • design features a crawler core for handling the
    main crawling tasks, and extensibility through
  • protocol and processing modules.
  • Users may supply new modules for performing
    customized crawling tasks.
  • We have used Mercator for a variety of purposes,
    including performing random walks on the web,
    crawling our corporate intranet, and collecting
    statistics about the web at large.

21
Parallel Crawlers
  • A paralle crawler is a crawler that runs multiple
    processes in parallel. The goal is to maximize
    the download rate while minimizing the overhead
    from parallelization and to avoid repeated
    downloads of the same page.
  • To avoid downloading the same page more than
    once, the crawling system requires a policy for
    assigning the new URLs discovered during the
    crawling process, as the same URL can be found by
    two different crawling processes.

22
Efficient Crawling Through URL Ordering
  • Define several different kinds of importance
    metrics, and built three models to evaluate
    crawlers.
  • Then evaluated several combinations of importance
    and ordering metrics, using the Stanford Web
    pages.

23
Efficient URL Caching for World Wide Web
Crawling
  • URL caching is very effective
  • Any web crawler must maintain a collection of
    URLs that areto be downloaded. Moreover, since it
    would be unacceptable to download the same URL
    over and over,
  • recommend a cache size of between 100 to 500
    entries per crawling thread
  • the size of the cache needed to achieve top
    performance depends on the number of threads

24
Multicasting a Web Repository
  • proposing an alternative to multiple crawlers A
    single central crawler builds a database of Web
    pages, and provides a multicast service for
    clients that need a subset of this Web image.

25
Distributed High-performance Web Crawlers
  • Distributing the workload across multiple
    machines
  • by divide and/or duplicate these pieces in the
    cluster.
  • The program will run simultaneously on two or
    more computers that are communicating with each
    other over a network.

26
Parallel Crawling for Online Social Networks
  • a centralized queue implemented as a database
    table is conveniently used to coordinate the
    operation of all the crawlers to prevent
    redundant crawling.
  • This offers two tiers of parallelism, allowing
    multiple crawlers to be run on each of the
    multiple agents, where the crawlers are not
    affected by any potential failing of the other
    crawlers.

27
Finding replicated web collections
  • Improving web crawling by avoiding redundant
    crawling in the Google system and proposing a new
    algorithm for efficiently identifying similar
    collections that form what we call a similar
    cluster.

28
Learnable Web Crawler
  • In this section, we will shortly explain a
    characteristic of the web crawler, learnable
    ability. We build some knowledge bases from the
    previous crawling. These knowledge bases are
  • Seed URLs
  • Topic Keywords
  • URL Prediction

29
The Algorithm No KB
  • Crawling_with_no_KB (topic)
  • seed_urls Search (topic, t)
  • keywords topic
  • foreach url (seed_urls)
  • url_topic url.title url.description
  • url_score sim (keywords, url_topic)
  • enqueue (url_queue, url, url_score)
  • while (url (url_queue) gt 0)
  • url dequeue_url_with_max_score
    (url_queue)
  • page fetch_new_document (url)
  • page_score sim (keywords, page)
  • foreach link (extract_urls (page))
  • link_score a.sim(keywords,link.anchortext)
  • (1-a).page_score
  • enqueue (url_queue, link, link_score)

30
The Algorithm With KB
  • Crawling_with_KB (KB, topic)
  • seed_urls get_seed_url(KB,topic,t)
  • keywords get_topic_keyword(KB,topic)
  • foreach url (seed_urls)
  • url_score get_pred_score(KB,topic,url)
  • enqueue (url_queue, url, url_score)
  • while (url (url_queue) gt 0)
  • url dequeue_url_with_max_score(url_queue)
  • page fetch_new_document (url)
  • page_score sim (keywords, page)
  • foreach link (extract_urls (page))
  • pred_link_score get_pred_score(KB,topic,url
    )
  • link_score a.(ß.sim(keywords,link.anchortex
    t)
  • (1- ß).pred_link_score)
  • (1-a) . page_score
  • enqueue (url_queue, link, link_score)

31
Overall Process
  • Learnable_Crawling (topic)
  • if (no KB)
  • Collection Crawling_with_no_KB (topic)
  • else
  • Collection Crawling_with_KB (KB, topic)
  • / To learn the previous crawling /
  • KB.seed_urls learn_seed_URL (Collection)
  • KB.keywords learn_topic_keyword
    (Collection)
  • KB.url_predict learn_URL_prediction(Collectio
    n)

32
Learning Analysis
33
(No Transcript)
34

THANK YOU FOR YOUR ATTENTION .
Write a Comment
User Comments (0)
About PowerShow.com