Implementing a Web Crawler and Indexer - PowerPoint PPT Presentation

About This Presentation
Title:

Implementing a Web Crawler and Indexer

Description:

Creates a Lexicon Object to store the words and their frequency of occurrence ... assigned connecting the word from the Lexicon to the instances in the Inverted List ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 10
Provided by: YOR46
Category:

less

Transcript and Presenter's Notes

Title: Implementing a Web Crawler and Indexer


1
Implementing a Web Crawler and Indexer
  • Group 6
  • Jena Gray, Brent Karstoff,
  • Sandy Leung, Ellie Rosen
  • Hugo Wong

2
Overview
  • Created a Web Crawler that logs URLs and words in
    an Index for use by a Search Engine
  • Web Crawler uses a Breadth First Search
  • Scans the designated number of pages, beginning
    with the seed page, as entered by the user
  • Results from the Web Crawler are stored in a
    2-level Indexer

3
Design
4
Web Crawler
  • Accesses the seed page
  • Scans the page for additional URLs
  • Adds new URLs to the bottom of the Queue
  • Gets the remaining page contents
  • Removes HTML Tags
  • Writes the URL ID and words from the page to a
    content file
  • Calls the Indexer
  • Retrieves the next page from the top of the queue

5
Indexer
  • Opens the Content File created by the Web Crawler
  • Creates a Lexicon Object to store the words and
    their frequency of occurrence
  • Creates an Inverted List Node for each new
    instance of the word
  • Pointers are assigned connecting the word from
    the Lexicon to the instances in the Inverted List
  • Binary Searches are used to establish correct
    placement of the words

6
Storage
  • After the designated number of pages have been
    searched the Web Crawler writes visited pages to
    a file
  • The Web Crawler calls createStorageFiles from the
    Indexer
  • The Indexer writes the contents of the Lexicon
    and Inverted List to individual files

7
Testing and Analysis
  • Our Web Crawler works successfully in searching
    upwards of 350 pages
  • The Indexer is functional and successfully
    outputs results to the designated files
  • It is recommended that the computer you are
    running our program from have at least Java 1.4
    installed on it

8
Program Demonstration
  • http//brentk.dyndns.org5800/

9
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com