Sergey%20Brin,%20lawrence%20Page,%20The%20anatomy%20of%20a%20large%20scale%20hypertextual%20web%20search%20Engine - PowerPoint PPT Presentation

About This Presentation
Title:

Sergey%20Brin,%20lawrence%20Page,%20The%20anatomy%20of%20a%20large%20scale%20hypertextual%20web%20search%20Engine

Description:

Describes their efforts to create a web search engine open for academia. Altavista, Lycos and Yahoo ruled, Internet bubble was still growing. Disclaimer ... – PowerPoint PPT presentation

Number of Views:443
Avg rating:3.0/5.0
Slides: 11
Provided by: rogierb4
Category:

less

Transcript and Presenter's Notes

Title: Sergey%20Brin,%20lawrence%20Page,%20The%20anatomy%20of%20a%20large%20scale%20hypertextual%20web%20search%20Engine


1
Sergey Brin, lawrence Page,The anatomy of a
large scale hypertextual web search Engine
  • Rogier Brussee
  • ICI 21 11 2005

2
Context
  • Written 1997, Brin and Page were PhD students
  • Indexes 24106 pages (ltlt 10100)
  • Describes their efforts to create a web search
    engine open for academia
  • Altavista, Lycos and Yahoo ruled, Internet bubble
    was still growing.

3
Disclaimer
  • Google does Google does in 1997. Can only guess
    what still applies
  • Principles sound right, probably survived
  • Lots of room for tweaking Dark Art
  • Datastructures described up to bit level should
    have changed.
  • Scaled up tremendously
  • Index gt 1010 pages ?????? (ltlt10100)
  • So did hardware and OS.
  • Business model changed
  • Ads should not drive search result is still
    stated policy.

4
What does Google do
  • Preprocess
  • Crawl
  • Index words, anchors and links in docs
  • Invert Index (i.e. sort)
  • Value content (PageRank looks weights)
  • At Query time
  • look up query
  • Rank results (PageRank IR measure)

5
Google Architecture (in 1997)
6
Ranking I
  • Google ranks words differently depending on
  • Capitalisation
  • Typeface (with respect to average)
  • In title or anchor or ..
  • For phrases also proximity of words is important
  • Gives IR score (precise formula is not mentioned)
  • And then there is PageRank !

7
Ranking II
  • Together determines rank you see when googling
  • No single factor is dominant

8
PageRank
  • Called after Lawrence Page.
  • Measure of collectively defined importance of web
    page
  • Probabilistic model of user doing random surfing
    before Google gives recommandation
  • PageRank is a probability to find user at page in
    model after infinite number of clicks
  • Quantitative version of effect of information
    scent
  • Really pioneered by ants !
  • Go to the ant, thou sluggard Consider her ways,
    and be wise (Proverbs 66)

9
Ant model for PageRank
(1-d)/n
(1-d)
d
1
k
Chance d to follow a link Chance (1-d) to jump to
random page out of n pages
2
10
Mathematical Explanation
  • We have initial ant distribution p (p_1, .p_n)
    on n pages
  • Normalise sum_i p_i 1, we have p_i gt 0.
  • We have a Markov chain with transition
    probability
  • t_ij d/k_j (1-d)/n if there is one of k_j
    links on page j to page i
  • t_ij (1-d)/ n otherwise
  • Gives transition matrix T (t_ji) , i,j 1,,n
  • Note t_ij gt 0 and sum_i t_ij 1.
  • After one round ant distribution is
  • Tp ( sum_j t_ij p_j)_i 1,..,n
  • Note (Tp)_i gt 0 and sum (Tp)_i 1.
  • After n rounds distribution is Tn p.
  • Define lim_n ? infty Tn p p(0) (exists)
  • Tp(0) p (0) stationary distribution of
    Markov chain
  • Pagerank is stationary distribution of the Markov
    chain
Write a Comment
User Comments (0)
About PowerShow.com