Title: The PageRank Citation Ranking: Bringing Order to the Web Page L' , Brin S' , Motwani R' , Winograd T
1The PageRank Citation Ranking Bringing Order to
the WebPage L. , Brin S. , Motwani R. ,
Winograd T. Stanford Digital Library
Technologies Projecthttp//dbpubs.stanford.edu/pu
b/1999-66
- Presented by Zheng Zhao
- Originally designed by Soumya Sanyal
- http//ranger.uta.edu/gdas/Courses/Spring2005/DBI
R/slides/The20PageRank20Citation20Ranking20-2
0Redone.ppt
2Outline
- Paper Citations and the Web Motivation
- PageRank Why it should be considered?
- More PageRank Nuts and bolts
- PageRank Unleashed Looking under the hood
- Convergence and Random Walks Why does it work?
- Implementation Getting your hands dirty
- Personalized PageRank The invisible source
- Applications What wasnt apparent already
- Conclusions
3Paper Citations and the Web Motivation
- Academic Citations link to other well known
papers - But they are peer reviewed and have quality
control - Web of academic documents are homogeneous in
their quality, usage, citation length
- Most web pages link to web pages as well
- Quality measure of a web page is subjective to
the user though - Importance of a page is a quantity that isnt
intuitively possible to capture
4Contd.
- An user wants to see what is most applicable to
her needs first. - The job of the retrieval system is to present the
more relevant documents up front. - The notion of quality or relative importance of a
web page magnifies - The average quality experienced by an user is
higher than the average quality of the average
web page.
- Notations Used
- Backlinks (inedges) Links that point to a
certain page - Forward Links (outedges) Links that emanate
from that page
5PageRank Why it should be considered?
- Think of a color palette
- Colors are formed by the mixture of one or more
colors - The amount and intensity of each color you mix
ultimately governs the color of the final mixture
not the number of colors !!! - Now think of a Web Page
- A number of back links (inedges) point to this
webpage - Say a certain back link came from Yahoo! and
another came from an obscure home page. - Think of the importance of the Yahoo! Page as
opposed to the importance of the home page. - Now say the importance of the Yahoo! Page was
mapped to the amount (intensity) of one color and
the home page to another color - Importance of back links rather than their
number.
6More PageRank Nuts and bolts
- Say for any Web Page u the number of forward
links is given by Fu and the number of back links
be Bu and Nu Fu - R() Rank of page u c Normalization
Constant - Note c lt 1 to cover for pages with no outgoing
links
7Contd..
- So what does the overall picture look like?
- A is designated to be a matrix, u and v
correspond to the columns of this matrix
8Contd.. (Matrices Revisited)
- Eigenvectors and eigenvalues
- Given that A is a matrix, and R be a vector over
all the Web pages, the dominant eigenvector is
the one associated with the maximal eigenvalue. - It can be found out by recursing the previous
equation till the recurrence converges. - A set of eigenvalues form what is called the
eigenspace.
9Contd.. (A Walk Through Example)
AT
10Contd..
- Matrix Notation
- R c A R M R
- c eigenvalue
- R eigenvector of A
- A x ? x
- A - ?I x 0
A
R
Normalized
11Contd.. (Markov Chains)
- Random surfer model
- Description of a random walk through the Web
graph - Interpreted as a transition matrix with
asymptotic probability that a surfer is currently
browsing that page - The above notion is fundamental to any Markovian
System. For a discrete notion of the above, the
following is assumed. - Rt M Rt-1 M transition matrix for a
first-order Markov chain (stochastic) - The question is does it converge to some sensible
solution (as t??) regardless of the initial ranks
?
12Contd..(Issues..)
- The above equation would converge were it not for
a little problem - This problem is called the Rank Sink Problem.
- The sink accumulates rank, but never distributes
it!
13Contd..()
- In general many Web pages dont have either
backlinks or forward links. - Results in dangling edges of the graph
- no parent ? rank 0
- MT converges to a matrix whose last column is all
zero - no children ? no solution
- MT converges to zero matrix
14Contd..(More Random Surfer)
- How do we escape from this ?
- A We actually escape from it.
- Say a surfer is randomly clicking and hopping
from one page to the other. - If this surfer keeps going back to the same set
of pages, she will get bored (in reality too) and
try and escape from this set of pages. - Hence, we associate an escape factor E to
account for this boredom. - How do we model this escape probability
- We term this E to be a vector over all the web
pages that accounts for each pages escape
probability.
15Contd..
- Given this Escape vector, how do we associate
this with the original model - In matrix notation
where - It can be rewritten as
- Hence
16PageRank Unleashed Looking under the hood
The main algorithm
- What can we say about d and ? ?
- d1 is called the eigengap and it controls the
rate of convergence - ? is the convergence threshold
17Convergence and Random Walks Why does it work?
- Irreducible Aperiodic Markov Chains with a
Primitive transition probability matrix - What is the issue all about?
- We need a transition matrix model that is
guaranteed convergence and does indeed converge
to a unique stationary distribution vector.
18Contd..
- Addition of the escape vector E, allows us to
make the original matrix A be both primitive and
stochastic - This guarantees convergence
- What about the addition of new links
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - The connectivity of a portion of the graph is
changed arbitrary - How will it affect the results of algorithms?
- Ng et al. (2001) IJCAI and Bianchini et al.
(2002) WWW02 - It is possible to perturb a symmetric matrix by a
quantity that grows as d1 that produces a
constant perturbation of the dominant eigenvector
19Contd..
- Convergence Experiment(s)
- Expander graphs and d1 (every subset S has a
neighborhood bounded by some factor ? times S) - Rapidly mixing random walk Convergence is
guaranteed in logarithmic time in the order of
the size of the graph
20Implementation Getting your hands dirty
- In 1998
- 24 million web pages
- Crawler builds an index of links
- To do this in 5 days, 50 Web pages/second need to
be crawled - 11 is the average outdegree, 550 links/second
- 75 million unique URLs to be compared against
- URLs are hashed to unique integer ID
- No dangling links are kept initially
- Vector E will help in convergence issues also
- Weights were kept for 75 million URLs _at_ 4
bytes/weight (300MB) - Access to link Database is linear since it is
sorted - 99 800 million pages 00 - 2 billion 01 4
billion
21Personalized PageRank The invisible source
- E10.15
- Web Pages are valued because they exist!
- Web Pages with many related links receive an
overly high ranking - The other extreme E for just one web page
- Netscape Home Page and John McCarthys home page
22Applications What wasnt apparent already
- Estimating Web Traffic
- How PageRank corresponds to actual usage
- Internet proxy cache from NLANR compared to
PageRank - 2.6 million pages intersect with PageRanks
indexed 75 mil. - Web based email access is one plausible reason
for this disparity - People look at certain pages but never link them
- Backlink Predictor
- PageRank is a better predictor for future
citation counts than citation counts themselves. - Experiment starts out with one URL and no other
information - Goal is to crawl the Web in the order of their
importance - Importance being an Evaluation function on the
number of citation counts (number of backlinks) - PageRank escapes local minima, citation count get
stuck in these.
23Conclusions
- In essence, the importance of one page being
dependent on the importance of its predecessors
is like a peer review. - NASDAQ 17th February, 2005 - 197.41 Need I
say More?