Title: The Anatomy of a largescale hypertextual Web search engine by Sergey Brin, Lawrence Page appearing i
1The Anatomy of a large-scale hypertextual Web
search engineby Sergey Brin, Lawrence
Pageappearing in Computer Networks and ISDN
Systems 1998
- Presented by
- Damon Sutherland
- University of Alabama
- in partial fulfillment of the requirements for
- Internet Algorithms course, Fall 2005
2Introduction to web searches
- First automated web bots searched linearly and
indexed URLs and titles only - Hard to search for specific items
- By late 1995 AltaVista launched the first search
engine with natural language queries - By late 1996 Lycos had indexed 60 million pages
- Yahoo was initially released as a list of the
creators favorite sites in 1994
3Overview of query processing
- User types computer
- Google
- finds related pages (based on anchor text words
in the page) - retrieves snippets from the top related pages
- returns the result to the user in order
4Motivation
- Increase the relevance of queries
- People generally view the first tens of results
5Motivation
- Increase the relevance of queries
- People generally view the first tens of results
6Motivation
Increase the relevance of queries
- As of late 1997, only one of four of the major
search engines returned a link to itself in the
top 10 results.
7Motivation
- Scalable
- by number of web pages indexed
8Motivation
- Scalable
- web queries per day
9How to find related pages
- By the text on a page
- Google parses the source code and breaks the text
into a series of word occurrences
10How to find related pages
- Anchor Text is the description of the link by the
page author. - ltA HREFpage2.htmgtI love dogs!lt/Agt
- Google believes the Anchor Text is as important
as the page text.
11Anchor text
- Anchor Text increases relevance
- Unlike other search engines, Google associates
the Anchor Text with the link it points to. - This allows Google to return pages that cannot be
crawled, ie., pictures, programs, etc.
12Anchor text, contd.
- Google ranking can be manipulated
- A large number of pages, using Anchor Text, can
influence the PageRank of a page. - Called a Google bomb.
source http//www.litigiousbastards.com/
13How to compute importance of pages
- Google creates a web citation map
- details the relationship of a significant
sample of hyperlinks on the web - a link to a node is a vote for that node
14Web citation graph
- Compute PageRank of each graph node
- your rank is high when several high-rank nodes
link to you - many nodes link to you
- Details subsequent talk
15Model / System description
Bring, Page. (1998) Fig. 1
16Model / System comparison
Heydon, Najork. (1999) Fig 1.
17- In 1998
- Indexed 26 million pages in 9 days
- The last 11 million in less than 3 days
- The HTTPWorker equivalent averages 48.5 pages per
second. - In 2005
- Indexed 8.1 billion web pages, 1 billion images,
and 1 billion Usenet posts.
18Future work
- Boolean Operators AND, -, , OR
- User context (location, etc.)
- Scale to 100 000 000 pages
- Use text around links as well as Anchor Text
- Proxy caches to build search databases
19Personal observations
- Google has become widespread
- Its become its own verb Just google it.
- Its launched map direction services, research
journal searches, etc. - This paper is old.
- Google indexed 2 billion webpages 4 years ago and
is up to 8 billion now.
20Related work
- J. Cho, H. Garcia-Molina and L. Page, Efficient
crawling through URL ordering, in Proc. Of the
7th International World Wide Web Conference (WWW
98), Brisbane, Australia, April 14-18, 1998. - R. Weiss, B. Velez, M.A. Sheldon, C. Manprempre,
P. Szilagyi, A. Duda, and D. K. Grifford,
HyPursuit a hierarchical network search enging
that exploits content-link hypertext clustering,
in Proc. of the 7th ACM Conference on Hypertext,
New York, 1996. - Cooper, Colin and Alan Frieze, Crawling on Simple
Models of Web Graphs, in Internet Mathematics 1
57-90, 2003