The PageRank Citation Ranking: Bringing Order to the Web Page L' , Brin S' , Motwani R' , Winograd T - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

The PageRank Citation Ranking: Bringing Order to the Web Page L' , Brin S' , Motwani R' , Winograd T

Description:

The PageRank Citation Ranking: Bringing Order to the Web ... Forward Links (outedges): Links that emanate from that page. University of Texas at Arlington ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 24
Provided by: CRe136
Category:

less

Transcript and Presenter's Notes

Title: The PageRank Citation Ranking: Bringing Order to the Web Page L' , Brin S' , Motwani R' , Winograd T


1
The PageRank Citation Ranking Bringing Order to
the WebPage L. , Brin S. , Motwani R. ,
Winograd T. Stanford Digital Library
Technologies Projecthttp//dbpubs.stanford.edu/pu
b/1999-66
  • Presented by Zheng Zhao
  • Originally designed by Soumya Sanyal
  • http//ranger.uta.edu/gdas/Courses/Spring2005/DBI
    R/slides/The20PageRank20Citation20Ranking20-2
    0Redone.ppt

2
Outline
  • Paper Citations and the Web Motivation
  • PageRank Why it should be considered?
  • More PageRank Nuts and bolts
  • PageRank Unleashed Looking under the hood
  • Convergence and Random Walks Why does it work?
  • Implementation Getting your hands dirty
  • Personalized PageRank The invisible source
  • Applications What wasnt apparent already
  • Conclusions

3
Paper Citations and the Web Motivation
  • Academic Citations link to other well known
    papers
  • But they are peer reviewed and have quality
    control
  • Web of academic documents are homogeneous in
    their quality, usage, citation length
  • Most web pages link to web pages as well
  • Quality measure of a web page is subjective to
    the user though
  • Importance of a page is a quantity that isnt
    intuitively possible to capture

4
Contd.
  • An user wants to see what is most applicable to
    her needs first.
  • The job of the retrieval system is to present the
    more relevant documents up front.
  • The notion of quality or relative importance of a
    web page magnifies
  • The average quality experienced by an user is
    higher than the average quality of the average
    web page.
  • Notations Used
  • Backlinks (inedges) Links that point to a
    certain page
  • Forward Links (outedges) Links that emanate
    from that page

5
PageRank Why it should be considered?
  • Think of a color palette
  • Colors are formed by the mixture of one or more
    colors
  • The amount and intensity of each color you mix
    ultimately governs the color of the final mixture
    not the number of colors !!!
  • Now think of a Web Page
  • A number of back links (inedges) point to this
    webpage
  • Say a certain back link came from Yahoo! and
    another came from an obscure home page.
  • Think of the importance of the Yahoo! Page as
    opposed to the importance of the home page.
  • Now say the importance of the Yahoo! Page was
    mapped to the amount (intensity) of one color and
    the home page to another color
  • Importance of back links rather than their
    number.



6
More PageRank Nuts and bolts
  • Say for any Web Page u the number of forward
    links is given by Fu and the number of back links
    be Bu and Nu Fu
  • R() Rank of page u c Normalization
    Constant
  • Note c lt 1 to cover for pages with no outgoing
    links

7
Contd..
  • So what does the overall picture look like?
  • A is designated to be a matrix, u and v
    correspond to the columns of this matrix

8
Contd.. (Matrices Revisited)
  • Eigenvectors and eigenvalues
  • Given that A is a matrix, and R be a vector over
    all the Web pages, the dominant eigenvector is
    the one associated with the maximal eigenvalue.
  • It can be found out by recursing the previous
    equation till the recurrence converges.
  • A set of eigenvalues form what is called the
    eigenspace.

9
Contd.. (A Walk Through Example)
  • Lets take an example

AT
10
Contd..
  • Matrix Notation
  • R c A R M R
  • c eigenvalue
  • R eigenvector of A
  • A x ? x
  • A - ?I x 0

A
R
Normalized
11
Contd.. (Markov Chains)
  • Random surfer model
  • Description of a random walk through the Web
    graph
  • Interpreted as a transition matrix with
    asymptotic probability that a surfer is currently
    browsing that page
  • The above notion is fundamental to any Markovian
    System. For a discrete notion of the above, the
    following is assumed.
  • Rt M Rt-1 M transition matrix for a
    first-order Markov chain (stochastic)
  • The question is does it converge to some sensible
    solution (as t??) regardless of the initial ranks
    ?

12
Contd..(Issues..)
  • The above equation would converge were it not for
    a little problem
  • This problem is called the Rank Sink Problem.
  • The sink accumulates rank, but never distributes
    it!

13
Contd..()
  • In general many Web pages dont have either
    backlinks or forward links.
  • Results in dangling edges of the graph
  • no parent ? rank 0
  • MT converges to a matrix whose last column is all
    zero
  • no children ? no solution
  • MT converges to zero matrix

14
Contd..(More Random Surfer)
  • How do we escape from this ?
  • A We actually escape from it.
  • Say a surfer is randomly clicking and hopping
    from one page to the other.
  • If this surfer keeps going back to the same set
    of pages, she will get bored (in reality too) and
    try and escape from this set of pages.
  • Hence, we associate an escape factor E to
    account for this boredom.
  • How do we model this escape probability
  • We term this E to be a vector over all the web
    pages that accounts for each pages escape
    probability.

15
Contd..
  • Given this Escape vector, how do we associate
    this with the original model
  • In matrix notation
    where
  • It can be rewritten as
  • Hence

16
PageRank Unleashed Looking under the hood
The main algorithm
  • What can we say about d and ? ?
  • d1 is called the eigengap and it controls the
    rate of convergence
  • ? is the convergence threshold

17
Convergence and Random Walks Why does it work?
  • Irreducible Aperiodic Markov Chains with a
    Primitive transition probability matrix
  • What is the issue all about?
  • We need a transition matrix model that is
    guaranteed convergence and does indeed converge
    to a unique stationary distribution vector.

18
Contd..
  • Addition of the escape vector E, allows us to
    make the original matrix A be both primitive and
    stochastic
  • This guarantees convergence
  • What about the addition of new links
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • The connectivity of a portion of the graph is
    changed arbitrary
  • How will it affect the results of algorithms?
  • Ng et al. (2001) IJCAI and Bianchini et al.
    (2002) WWW02
  • It is possible to perturb a symmetric matrix by a
    quantity that grows as d1 that produces a
    constant perturbation of the dominant eigenvector

19
Contd..
  • Convergence Experiment(s)
  • Expander graphs and d1 (every subset S has a
    neighborhood bounded by some factor ? times S)
  • Rapidly mixing random walk Convergence is
    guaranteed in logarithmic time in the order of
    the size of the graph

20
Implementation Getting your hands dirty
  • In 1998
  • 24 million web pages
  • Crawler builds an index of links
  • To do this in 5 days, 50 Web pages/second need to
    be crawled
  • 11 is the average outdegree, 550 links/second
  • 75 million unique URLs to be compared against
  • URLs are hashed to unique integer ID
  • No dangling links are kept initially
  • Vector E will help in convergence issues also
  • Weights were kept for 75 million URLs _at_ 4
    bytes/weight (300MB)
  • Access to link Database is linear since it is
    sorted
  • 99 800 million pages 00 - 2 billion 01 4
    billion

21
Personalized PageRank The invisible source
  • E10.15
  • Web Pages are valued because they exist!
  • Web Pages with many related links receive an
    overly high ranking
  • The other extreme E for just one web page
  • Netscape Home Page and John McCarthys home page

22
Applications What wasnt apparent already
  • Estimating Web Traffic
  • How PageRank corresponds to actual usage
  • Internet proxy cache from NLANR compared to
    PageRank
  • 2.6 million pages intersect with PageRanks
    indexed 75 mil.
  • Web based email access is one plausible reason
    for this disparity
  • People look at certain pages but never link them
  • Backlink Predictor
  • PageRank is a better predictor for future
    citation counts than citation counts themselves.
  • Experiment starts out with one URL and no other
    information
  • Goal is to crawl the Web in the order of their
    importance
  • Importance being an Evaluation function on the
    number of citation counts (number of backlinks)
  • PageRank escapes local minima, citation count get
    stuck in these.

23
Conclusions
  • In essence, the importance of one page being
    dependent on the importance of its predecessors
    is like a peer review.
  • NASDAQ 17th February, 2005 - 197.41 Need I
    say More?
Write a Comment
User Comments (0)
About PowerShow.com