Web Crawler - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Web Crawler

Description:

Web crawler: Definition. Web Crawler (spider, robot) is a program which ... In summary, Web Crawler is for finding, checking, and gathering stuffs from the WWW. ... – PowerPoint PPT presentation

Number of Views:2396
Avg rating:3.0/5.0
Slides: 17
Provided by: yxie
Category:

less

Transcript and Presenter's Notes

Title: Web Crawler


1
Web Crawler
  • Dr. Ying Xie

2
Information Retrieval Process
Documents
Information Need
Formulation
Indexing
Query Rep
Inverted index
Ranking
Ranked List
Learning
User Relevance Feedback
3
Web Search Process
Crawler
Information Need
Formulation
Indexing
Query Rep
Inverted index and web graph
Ranking
Ranked List
Learning
User Relevance Feedback
4
Web crawler Definition
  • Web Crawler (spider, robot) is a program which
    fetches information from the World Wide Web in a
    automated manner

5
Web Crawler Usages
  • It is mainly used to create a copy of all the
    visited pages for later processing (indexing and
    retrieving) by a search engine
  • It is also used to gather specific types of
    information from WWW
  • -- harvest email address (for spam
    purpose)
  • -- Event extraction
  • -- infectious disease outbreaks
    detection
  • -- coursework watchdog
  • Achive the WWW (www.archive.org)
  • In summary, Web Crawler is for finding, checking,
    and gathering stuffs from the WWW.

6
(No Transcript)
7
Basic Crawling Algorithm
  • G a group of seed URLs
  • Repeat
  • choose a URL U from G //Crawling
    strategy
  • download the webpage w from U
  • for each link l embedded in w
  • if l has not been crawled before,
  • add l to G.

8
Crawling Strategy
  • Random
  • Breath First Search
  • Depth First Search
  • Priority Search

9
Breath First Search
  • Group G is implemented as a Queue
  • Queue is a linear data structure such that an
    element can enter into it only at the rear of the
    queue and get out it only at the front of the
    queue. FIFO

10
Depth First Search
  • Group G is implemented as a Stack
  • Stack is a linear data structure such that an
    element can enter into and get out only at the
    top of it. LIFO

11
Priority Search
  • Group G is implemented as a priority Queue.
  • Priority queue is the linear data structure
    such an element with top priority will get out
    first.

12
Focused Crawling
  • Priority Search is used to implement focused
    crawling.

13
Robots Exclusion
  • The Robots Exclusion Protocol A Web site
    administrator can indicate which parts of the
    site should not be vistsed by a robot, by
    providing a specially formatted file on their
    site, in http//.../robots.txt.
  • The Robots META tag A Web author can indicate if
    a page may or may not be indexed, or analyzed for
    links, through the use of a special HTML META
    tag.

14
Robot Exclusion Protocol
  • In a nutshell, when a Robot visits a Web site,
    say http//www.foobar.com/, it firsts checks for
    http//www.foobar.com/robots.txt.
  • If it can find this document, it will analyze its
    contents for records like
  • User-agent
  • Disallow /
  • to see if it is allowed to retrieve the
    document.

15
robots.txt example
16
Robot Traps
  • Because there is no editorial control over the
    internet, Web Crawlers should protect themselves
    from ill-formed html or misleading sites.
  • - ill-formed html page with 68 kB of null
    characters
  • - misleading sites CGI scripts can be used
    to generate infinite number of pages dynamically.
  • Solutions
  • - Eliminate URLs with non-textual data types
  • - URL length check
  • - maintain the statistics of a website. If
    the pages from a website exceedingly large, then
    remove the URLs coming from this website.
Write a Comment
User Comments (0)
About PowerShow.com