The Anatomy of a largescale hypertextual Web search engine by Sergey Brin, Lawrence Page appearing i - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

The Anatomy of a largescale hypertextual Web search engine by Sergey Brin, Lawrence Page appearing i

Description:

finds related pages (based on anchor text ... As of late 1997, only one of four of the major search engines ... Conference on Hypertext, New York, 1996. ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 21
Provided by: csUa
Category:

less

Transcript and Presenter's Notes

Title: The Anatomy of a largescale hypertextual Web search engine by Sergey Brin, Lawrence Page appearing i


1
The Anatomy of a large-scale hypertextual Web
search engineby Sergey Brin, Lawrence
Pageappearing in Computer Networks and ISDN
Systems 1998
  • Presented by
  • Damon Sutherland
  • University of Alabama
  • in partial fulfillment of the requirements for
  • Internet Algorithms course, Fall 2005

2
Introduction to web searches
  • First automated web bots searched linearly and
    indexed URLs and titles only
  • Hard to search for specific items
  • By late 1995 AltaVista launched the first search
    engine with natural language queries
  • By late 1996 Lycos had indexed 60 million pages
  • Yahoo was initially released as a list of the
    creators favorite sites in 1994

3
Overview of query processing
  • User types computer
  • Google
  • finds related pages (based on anchor text words
    in the page)
  • retrieves snippets from the top related pages
  • returns the result to the user in order

4
Motivation
  • Increase the relevance of queries
  • People generally view the first tens of results

5
Motivation
  • Increase the relevance of queries
  • People generally view the first tens of results

6
Motivation
Increase the relevance of queries
  • As of late 1997, only one of four of the major
    search engines returned a link to itself in the
    top 10 results.

7
Motivation
  • Scalable
  • by number of web pages indexed

8
Motivation
  • Scalable
  • web queries per day

9
How to find related pages
  • By the text on a page
  • Google parses the source code and breaks the text
    into a series of word occurrences

10
How to find related pages
  • Anchor Text is the description of the link by the
    page author.
  • ltA HREFpage2.htmgtI love dogs!lt/Agt
  • Google believes the Anchor Text is as important
    as the page text.

11
Anchor text
  • Anchor Text increases relevance
  • Unlike other search engines, Google associates
    the Anchor Text with the link it points to.
  • This allows Google to return pages that cannot be
    crawled, ie., pictures, programs, etc.

12
Anchor text, contd.
  • Google ranking can be manipulated
  • A large number of pages, using Anchor Text, can
    influence the PageRank of a page.
  • Called a Google bomb.

source http//www.litigiousbastards.com/
13
How to compute importance of pages
  • Google creates a web citation map
  • details the relationship of a significant
    sample of hyperlinks on the web
  • a link to a node is a vote for that node

14
Web citation graph
  • Compute PageRank of each graph node
  • your rank is high when several high-rank nodes
    link to you
  • many nodes link to you
  • Details subsequent talk

15
Model / System description
Bring, Page. (1998) Fig. 1
16
Model / System comparison
Heydon, Najork. (1999) Fig 1.
17
  • In 1998
  • Indexed 26 million pages in 9 days
  • The last 11 million in less than 3 days
  • The HTTPWorker equivalent averages 48.5 pages per
    second.
  • In 2005
  • Indexed 8.1 billion web pages, 1 billion images,
    and 1 billion Usenet posts.

18
Future work
  • Boolean Operators AND, -, , OR
  • User context (location, etc.)
  • Scale to 100 000 000 pages
  • Use text around links as well as Anchor Text
  • Proxy caches to build search databases


19
Personal observations
  • Google has become widespread
  • Its become its own verb Just google it.
  • Its launched map direction services, research
    journal searches, etc.
  • This paper is old.
  • Google indexed 2 billion webpages 4 years ago and
    is up to 8 billion now.

20
Related work
  • J. Cho, H. Garcia-Molina and L. Page, Efficient
    crawling through URL ordering, in Proc. Of the
    7th International World Wide Web Conference (WWW
    98), Brisbane, Australia, April 14-18, 1998.
  • R. Weiss, B. Velez, M.A. Sheldon, C. Manprempre,
    P. Szilagyi, A. Duda, and D. K. Grifford,
    HyPursuit a hierarchical network search enging
    that exploits content-link hypertext clustering,
    in Proc. of the 7th ACM Conference on Hypertext,
    New York, 1996.
  • Cooper, Colin and Alan Frieze, Crawling on Simple
    Models of Web Graphs, in Internet Mathematics 1
    57-90, 2003
Write a Comment
User Comments (0)
About PowerShow.com